DataFrame from CSV Files

pandas
Published

December 6, 2024

Importing CSV Data into a DataFrame

The primary function for importing CSV data is pandas.read_csv(). This function offers a wide range of parameters for customizing the import process, handling different delimiters, and managing missing data.

Let’s start with a simple example:

import pandas as pd

data = pd.read_csv("data.csv")

print(data.head())

This code snippet assumes you have a CSV file named “data.csv” in the same directory as your Python script. pd.read_csv() automatically infers the data types of each column. data.head() displays the first five rows for a quick preview.

Handling Different Delimiters

CSV files don’t always use commas as delimiters. read_csv() allows you to specify a different delimiter using the sep or delimiter argument:

data_tab = pd.read_csv("data_tab.csv", sep="\t")
print(data_tab.head())

This example reads a file “data_tab.csv” where columns are separated by tabs.

Specifying Data Types

For better performance and control, you can explicitly define the data types of your columns using the dtype argument:

data_types = {'column1': int, 'column2': str, 'column3': float}
data_typed = pd.read_csv("data.csv", dtype=data_types)
print(data_typed.dtypes)

This ensures that column1 is treated as integers, column2 as strings, and column3 as floats.

Handling Missing Data

Missing values in CSV files are often represented by empty strings or special values like “NA” or “NULL”. read_csv() provides options for handling these:

data_missing = pd.read_csv("data_missing.csv", na_values=['NA', ''])
print(data_missing.isnull().sum()) #Check for missing values after import

This example treats “NA” and empty strings as missing values (NaN).

Selecting Columns and Rows

Once your data is in a DataFrame, selecting specific columns or rows is straightforward:

column_data = data['column_name']
print(column_data)


#Select multiple columns
multiple_columns = data[['column_name1','column_name2']]
print(multiple_columns)

filtered_data = data[data['column_name'] > 10]
print(filtered_data)

This demonstrates selecting a column, multiple columns and filtering rows based on a condition. This is just a starting point. Pandas offers a vast array of functionalities for data cleaning, transformation, and analysis.

Working with Large CSV Files

For extremely large CSV files that might not fit in memory, consider using the chunksize argument in read_csv() to process the data in smaller, manageable chunks:

chunksize = 1000  # Process 1000 rows at a time
for chunk in pd.read_csv("large_file.csv", chunksize=chunksize):
    # Process each chunk individually
    # ... your data processing logic here ...
    print(chunk.head())

This iterates through the file in chunks, allowing for more memory-efficient processing.