Understanding Boolean Indexing
Boolean indexing leverages boolean masks – arrays of True
and False
values – to filter data. These masks are the same length as the index of your DataFrame. Where a mask value is True
, the corresponding row (or column) is selected; where it’s False
, it’s excluded.
The power lies in creating these masks using conditional expressions applied directly to your DataFrame columns.
Basic Boolean Indexing
Let’s start with a simple example. Imagine you have a DataFrame of customer information:
import pandas as pd
= {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
data 'Age': [25, 30, 22, 28],
'City': ['New York', 'London', 'Paris', 'Tokyo']}
= pd.DataFrame(data)
df print(df)
This will output:
Name Age City
0 Alice 25 New York
1 Bob 30 London
2 Charlie 22 Paris
3 David 28 Tokyo
Now, let’s select only the customers older than 25:
= df['Age'] > 25
older_than_25 print(older_than_25) # This shows the boolean mask
= df[older_than_25]
selected_customers print(selected_customers)
This will first print the boolean mask (True
, True
, False
, True
) and then filter the DataFrame, showing only Bob and David’s information.
Combining Conditions with Logical Operators
Boolean indexing shines when you need to apply multiple conditions. Pandas supports the standard logical operators: &
(and), |
(or), and ~
(not).
To find customers older than 25 and living in London:
= (df['Age'] > 25) & (df['City'] == 'London')
london_and_older print(df[london_and_older])
This will only return Bob’s information, fulfilling both conditions.
Using .isin()
for Multiple Values
The .isin()
method provides a concise way to check for membership in a list or array:
= ['London', 'Paris']
cities_of_interest = df[df['City'].isin(cities_of_interest)]
customers_in_cities print(customers_in_cities)
This neatly selects customers from London and Paris.
Indexing with .loc
and .iloc
for enhanced control
While direct boolean indexing is powerful, combining it with .loc
and .iloc
offers even finer control:
= df['Age'] > 25
older_than_25 = df.loc[older_than_25, ['Name', 'City']]
selected_data print(selected_data)
= df.iloc[[0, 2], [0, 2]]
selected_data_iloc print(selected_data_iloc)
This illustrates how to selectively extract specific columns along with row selection using boolean masks and integer-based indexing, showcasing the flexibility offered by combining approaches.
Beyond Basic Comparisons: Applying Custom Functions
For more complex filtering, you can apply custom functions using .apply()
:
def is_young(age):
return age < 25
'Young'] = df['Age'].apply(is_young)
df[= df[df['Young']]
young_customers print(young_customers)
This example creates a new boolean column (‘Young’) based on a custom function and then uses it for filtering. The possibilities are extensive depending upon your data requirements.