GroupBy with Aggregation

pandas
Published

December 12, 2024

Understanding GroupBy

The groupby() method in Pandas allows you to group rows of a DataFrame based on the values of one or more columns. Think of it as creating subsets of your data based on shared characteristics. Once grouped, you can then apply aggregate functions to calculate summary statistics for each group.

Basic GroupBy with Aggregation

Let’s start with a simple example. Suppose we have a DataFrame containing sales data:

import pandas as pd

data = {'Region': ['North', 'North', 'South', 'South', 'East', 'East'],
        'Product': ['A', 'B', 'A', 'B', 'A', 'B'],
        'Sales': [100, 150, 120, 80, 90, 110]}

df = pd.DataFrame(data)
print(df)

This will output:

  Region Product  Sales
0  North       A    100
1  North       B    150
2  South       A    120
3  South       B     80
4   East       A     90
5   East       B    110

To calculate the total sales for each region, we can use groupby() and the sum() aggregation function:

region_sales = df.groupby('Region')['Sales'].sum()
print(region_sales)

This will produce:

Region
East     200
North    250
South    200
Name: Sales, dtype: int64

This shows the total sales for each region. We grouped by ‘Region’ and then aggregated the ‘Sales’ column using the sum.

Multiple Aggregation Functions

You can apply multiple aggregation functions simultaneously using the agg() method:

region_summary = df.groupby('Region')['Sales'].agg(['sum', 'mean', 'min', 'max'])
print(region_summary)

This will output a table with the sum, mean, minimum, and maximum sales for each region.

Grouping by Multiple Columns

You can group by multiple columns to create more granular groupings. For instance, to find the total sales for each region and product:

region_product_sales = df.groupby(['Region', 'Product'])['Sales'].sum()
print(region_product_sales)

This provides a more detailed breakdown of sales.

Custom Aggregation Functions

You can also define your own custom aggregation functions. For example, to calculate the range of sales for each region:

def range_fn(x):
  return x.max() - x.min()

region_range = df.groupby('Region')['Sales'].agg(range_fn)
print(region_range)

This demonstrates the flexibility of groupby() and aggregation in Pandas. You can adapt these techniques to analyze your data effectively, regardless of its complexity. Remember to explore the vast array of aggregation functions available in Pandas to find the best fit for your analytical needs. This allows for powerful data summarization and insightful analysis.