Boolean Indexing in Pandas

pandas
Published

May 28, 2023

Understanding Boolean Indexing

Boolean indexing leverages boolean masks – arrays of True and False values – to filter data. These masks are the same length as the index of your DataFrame. Where a mask value is True, the corresponding row (or column) is selected; where it’s False, it’s excluded.

The power lies in creating these masks using conditional expressions applied directly to your DataFrame columns.

Basic Boolean Indexing

Let’s start with a simple example. Imagine you have a DataFrame of customer information:

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 22, 28],
        'City': ['New York', 'London', 'Paris', 'Tokyo']}
df = pd.DataFrame(data)
print(df)

This will output:

      Name  Age      City
0    Alice   25  New York
1      Bob   30    London
2  Charlie   22     Paris
3    David   28     Tokyo

Now, let’s select only the customers older than 25:

older_than_25 = df['Age'] > 25
print(older_than_25)  # This shows the boolean mask
selected_customers = df[older_than_25]
print(selected_customers)

This will first print the boolean mask (True, True, False, True) and then filter the DataFrame, showing only Bob and David’s information.

Combining Conditions with Logical Operators

Boolean indexing shines when you need to apply multiple conditions. Pandas supports the standard logical operators: & (and), | (or), and ~ (not).

To find customers older than 25 and living in London:

london_and_older = (df['Age'] > 25) & (df['City'] == 'London')
print(df[london_and_older])

This will only return Bob’s information, fulfilling both conditions.

Using .isin() for Multiple Values

The .isin() method provides a concise way to check for membership in a list or array:

cities_of_interest = ['London', 'Paris']
customers_in_cities = df[df['City'].isin(cities_of_interest)]
print(customers_in_cities)

This neatly selects customers from London and Paris.

Indexing with .loc and .iloc for enhanced control

While direct boolean indexing is powerful, combining it with .loc and .iloc offers even finer control:

older_than_25 = df['Age'] > 25
selected_data = df.loc[older_than_25, ['Name', 'City']]
print(selected_data)


selected_data_iloc = df.iloc[[0, 2], [0, 2]]
print(selected_data_iloc)

This illustrates how to selectively extract specific columns along with row selection using boolean masks and integer-based indexing, showcasing the flexibility offered by combining approaches.

Beyond Basic Comparisons: Applying Custom Functions

For more complex filtering, you can apply custom functions using .apply():

def is_young(age):
  return age < 25

df['Young'] = df['Age'].apply(is_young)
young_customers = df[df['Young']]
print(young_customers)

This example creates a new boolean column (‘Young’) based on a custom function and then uses it for filtering. The possibilities are extensive depending upon your data requirements.