Filtering Data in Pandas

pandas
Published

December 25, 2023

Boolean Indexing: The Foundation of Pandas Filtering

Boolean indexing is the core mechanism behind Pandas filtering. It involves creating a boolean Series (a Series containing only True and False values) that acts as a mask, selecting only the rows where the mask is True.

Let’s illustrate with an example:

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 22, 28],
        'City': ['New York', 'London', 'Paris', 'Tokyo']}

df = pd.DataFrame(data)

filtered_df = df[df['Age'] > 25]
print(filtered_df)

This code first creates a DataFrame. Then, df['Age'] > 25 creates a boolean Series where True indicates ages greater than 25. This Series is used to select rows from the DataFrame, resulting in a new DataFrame containing only those individuals older than 25.

Combining Multiple Conditions

You can combine multiple conditions using logical operators like & (and), | (or), and ~ (not).

filtered_df = df[(df['Age'] > 25) & (df['City'] == 'London')]
print(filtered_df)

filtered_df = df[(df['Age'] > 25) | (df['City'] == 'Paris')]
print(filtered_df)

filtered_df = df[~(df['City'] == 'New York')]
print(filtered_df)

Remember to use parentheses to group conditions correctly, ensuring the logical operations are applied as intended.

The .query() Method: A More Readable Approach

For more complex filtering conditions, the .query() method offers a more readable syntax:

filtered_df = df.query('Age > 25 and City == "London"')
print(filtered_df)

This achieves the same result as the previous AND condition example but with improved readability, especially when dealing with many conditions. Note that the column names are used directly within the query string.

Filtering with .isin()

The .isin() method is useful when you want to check if values are present in a specific list:

cities_to_include = ['New York', 'London']
filtered_df = df[df['City'].isin(cities_to_include)]
print(filtered_df)

This efficiently filters based on whether the ‘City’ column values are contained within cities_to_include.

Filtering with str methods (for string data)

Pandas provides convenient string methods for filtering text data. For example:

#Filter for names containing "a"
filtered_df = df[df['Name'].str.contains('a')]
print(filtered_df)

#Filter for names starting with "A"
filtered_df = df[df['Name'].str.startswith('A')]
print(filtered_df)

These string methods provide powerful tools for complex text-based filtering. Remember that these methods are applied to string columns and not numeric columns.

Handling Missing Data During Filtering

Missing data (NaN) can affect filtering results. Be mindful of how you handle NaN values. You may need to use the .dropna() method to remove rows with missing data before or after filtering, depending on your requirements. You can also use the .notna() method to include only rows with non-missing values.