GroupBy Multiple Columns

pandas
Published

May 15, 2023

Grouping by Multiple Columns: The Basics

The core concept remains the same: you specify the columns to group by as a list within the groupby() method. Pandas will then create groups based on the unique combinations of values in those columns.

Let’s start with a simple example. Imagine you have a dataset of sales transactions:

import pandas as pd

data = {'Region': ['North', 'North', 'South', 'South', 'East', 'East'],
        'Product': ['A', 'B', 'A', 'B', 'A', 'B'],
        'Sales': [100, 150, 200, 250, 120, 180]}

df = pd.DataFrame(data)
print(df)

This will output:

  Region Product  Sales
0  North       A    100
1  North       B    150
2  South       A    200
3  South       B    250
4   East       A    120
5   East       B    180

To group by both Region and Product, we use:

grouped = df.groupby(['Region', 'Product'])
print(grouped)

This doesn’t display the grouped data directly. It creates a GroupBy object. To see the actual results, we need to apply an aggregation function:

grouped_sum = grouped['Sales'].sum()
print(grouped_sum)

This outputs:

Region  Product
East    A          120
        B          180
North   A          100
        B          150
South   A          200
        B          250
Name: Sales, dtype: int64

This shows the total sales for each combination of Region and Product.

Beyond sum(): Other Aggregation Functions

The groupby() method works seamlessly with other aggregation functions like mean(), count(), max(), min(), std(), etc.

grouped_mean = grouped['Sales'].mean()
print(grouped_mean)

grouped_count = grouped['Sales'].count()
print(grouped_count)

Applying Multiple Aggregations at Once

Instead of calling multiple aggregation functions separately, you can use agg() to apply several at once:

aggregated = grouped['Sales'].agg(['sum', 'mean', 'count'])
print(aggregated)

This provides a concise summary of sales data for each group.

Handling Missing Values

If your dataset contains missing values (NaN), Pandas handles them intelligently during aggregation. By default, sum() and mean() ignore NaN values. You can control this behavior using the dropna parameter in certain aggregation functions if needed.

Grouping and More Complex Operations

The possibilities extend far beyond simple aggregations. You can use groupby() as a stepping stone for more complex data manipulations, filtering, transformations, and more. The key is to understand how to use the GroupBy object to perform these operations efficiently on your grouped data. This will be explored in future posts.