Boolean Indexing: The Foundation of Pandas Filtering
Boolean indexing is the core mechanism behind Pandas filtering. It involves creating a boolean Series (a Series containing only True
and False
values) that acts as a mask, selecting only the rows where the mask is True
.
Let’s illustrate with an example:
import pandas as pd
= {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
data 'Age': [25, 30, 22, 28],
'City': ['New York', 'London', 'Paris', 'Tokyo']}
= pd.DataFrame(data)
df
= df[df['Age'] > 25]
filtered_df print(filtered_df)
This code first creates a DataFrame. Then, df['Age'] > 25
creates a boolean Series where True
indicates ages greater than 25. This Series is used to select rows from the DataFrame, resulting in a new DataFrame containing only those individuals older than 25.
Combining Multiple Conditions
You can combine multiple conditions using logical operators like &
(and), |
(or), and ~
(not).
= df[(df['Age'] > 25) & (df['City'] == 'London')]
filtered_df print(filtered_df)
= df[(df['Age'] > 25) | (df['City'] == 'Paris')]
filtered_df print(filtered_df)
= df[~(df['City'] == 'New York')]
filtered_df print(filtered_df)
Remember to use parentheses to group conditions correctly, ensuring the logical operations are applied as intended.
The .query()
Method: A More Readable Approach
For more complex filtering conditions, the .query()
method offers a more readable syntax:
= df.query('Age > 25 and City == "London"')
filtered_df print(filtered_df)
This achieves the same result as the previous AND condition example but with improved readability, especially when dealing with many conditions. Note that the column names are used directly within the query string.
Filtering with .isin()
The .isin()
method is useful when you want to check if values are present in a specific list:
= ['New York', 'London']
cities_to_include = df[df['City'].isin(cities_to_include)]
filtered_df print(filtered_df)
This efficiently filters based on whether the ‘City’ column values are contained within cities_to_include
.
Filtering with str
methods (for string data)
Pandas provides convenient string methods for filtering text data. For example:
#Filter for names containing "a"
= df[df['Name'].str.contains('a')]
filtered_df print(filtered_df)
#Filter for names starting with "A"
= df[df['Name'].str.startswith('A')]
filtered_df print(filtered_df)
These string methods provide powerful tools for complex text-based filtering. Remember that these methods are applied to string columns and not numeric columns.
Handling Missing Data During Filtering
Missing data (NaN) can affect filtering results. Be mindful of how you handle NaN values. You may need to use the .dropna()
method to remove rows with missing data before or after filtering, depending on your requirements. You can also use the .notna()
method to include only rows with non-missing values.