Importing CSV Data into a DataFrame
The primary function for importing CSV data is pandas.read_csv()
. This function offers a wide range of parameters for customizing the import process, handling different delimiters, and managing missing data.
Let’s start with a simple example:
import pandas as pd
= pd.read_csv("data.csv")
data
print(data.head())
This code snippet assumes you have a CSV file named “data.csv” in the same directory as your Python script. pd.read_csv()
automatically infers the data types of each column. data.head()
displays the first five rows for a quick preview.
Handling Different Delimiters
CSV files don’t always use commas as delimiters. read_csv()
allows you to specify a different delimiter using the sep
or delimiter
argument:
= pd.read_csv("data_tab.csv", sep="\t")
data_tab print(data_tab.head())
This example reads a file “data_tab.csv” where columns are separated by tabs.
Specifying Data Types
For better performance and control, you can explicitly define the data types of your columns using the dtype
argument:
= {'column1': int, 'column2': str, 'column3': float}
data_types = pd.read_csv("data.csv", dtype=data_types)
data_typed print(data_typed.dtypes)
This ensures that column1
is treated as integers, column2
as strings, and column3
as floats.
Handling Missing Data
Missing values in CSV files are often represented by empty strings or special values like “NA” or “NULL”. read_csv()
provides options for handling these:
= pd.read_csv("data_missing.csv", na_values=['NA', ''])
data_missing print(data_missing.isnull().sum()) #Check for missing values after import
This example treats “NA” and empty strings as missing values (NaN).
Selecting Columns and Rows
Once your data is in a DataFrame, selecting specific columns or rows is straightforward:
= data['column_name']
column_data print(column_data)
#Select multiple columns
= data[['column_name1','column_name2']]
multiple_columns print(multiple_columns)
= data[data['column_name'] > 10]
filtered_data print(filtered_data)
This demonstrates selecting a column, multiple columns and filtering rows based on a condition. This is just a starting point. Pandas offers a vast array of functionalities for data cleaning, transformation, and analysis.
Working with Large CSV Files
For extremely large CSV files that might not fit in memory, consider using the chunksize
argument in read_csv()
to process the data in smaller, manageable chunks:
= 1000 # Process 1000 rows at a time
chunksize for chunk in pd.read_csv("large_file.csv", chunksize=chunksize):
# Process each chunk individually
# ... your data processing logic here ...
print(chunk.head())
This iterates through the file in chunks, allowing for more memory-efficient processing.