Custom Aggregation Functions

pandas
Published

August 23, 2024

Why Custom Aggregation Functions?

Standard aggregation functions are excellent for common tasks. But what if you need to calculate something more nuanced? For example:

  • Weighted Averages: Calculating an average where different data points have varying weights.
  • Custom Metrics: Defining a metric specific to your domain (e.g., a custom performance indicator).
  • Complex Calculations: Combining multiple aggregations into a single, meaningful result.
  • Data Cleaning/Transformation: Performing aggregation while simultaneously cleaning or transforming data within the aggregation process.

Creating Custom Aggregation Functions

The core concept is to write a Python function that takes a Pandas Series as input and returns a single value representing the aggregated result. This function is then passed to Pandas’ agg() method.

Example 1: Weighted Average

Let’s say we have sales data with both sales figures and associated weights (representing, perhaps, customer importance):

import pandas as pd

data = {'Sales': [100, 200, 300, 400],
        'Weight': [0.1, 0.3, 0.4, 0.2]}
df = pd.DataFrame(data)

def weighted_average(series):
    weights = series['Weight']
    sales = series['Sales']
    return (sales * weights).sum() / weights.sum()


weighted_avg = df.agg(weighted_average)
print(weighted_avg) #Output: Sales    260.0

This example defines weighted_average which calculates the weighted average of ‘Sales’ using the ‘Weight’ column. The agg() method applies this function to the entire DataFrame.

Example 2: Custom Percentile

Pandas provides percentiles, but let’s create a function to calculate a custom percentile (e.g., the 85th percentile):

import numpy as np
import pandas as pd

data = {'Values': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]}
df = pd.DataFrame(data)


def custom_percentile(series, percentile):
    return np.percentile(series, percentile)

percentile_85 = df.agg({'Values': lambda x: custom_percentile(x, 85)})
print(percentile_85) #Output: Values    85.0

Here, custom_percentile takes both the series and the desired percentile as input. Note the use of a lambda function for brevity.

Example 3: Multiple Aggregations with Custom Functions

We can combine multiple custom aggregations within a single agg() call:

import pandas as pd

data = {'Values': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)


def custom_sum(series):
  return series.sum()

def custom_range(series):
  return series.max() - series.min()


aggregated_data = df.agg({'Values': [custom_sum, custom_range]})
print(aggregated_data)
#Output:      Values

This demonstrates the flexibility of using multiple custom functions within agg(), providing a concise way to perform diverse calculations.