Why Custom Aggregation Functions?
Standard aggregation functions are excellent for common tasks. But what if you need to calculate something more nuanced? For example:
- Weighted Averages: Calculating an average where different data points have varying weights.
- Custom Metrics: Defining a metric specific to your domain (e.g., a custom performance indicator).
- Complex Calculations: Combining multiple aggregations into a single, meaningful result.
- Data Cleaning/Transformation: Performing aggregation while simultaneously cleaning or transforming data within the aggregation process.
Creating Custom Aggregation Functions
The core concept is to write a Python function that takes a Pandas Series as input and returns a single value representing the aggregated result. This function is then passed to Pandas’ agg()
method.
Example 1: Weighted Average
Let’s say we have sales data with both sales figures and associated weights (representing, perhaps, customer importance):
import pandas as pd
= {'Sales': [100, 200, 300, 400],
data 'Weight': [0.1, 0.3, 0.4, 0.2]}
= pd.DataFrame(data)
df
def weighted_average(series):
= series['Weight']
weights = series['Sales']
sales return (sales * weights).sum() / weights.sum()
= df.agg(weighted_average)
weighted_avg print(weighted_avg) #Output: Sales 260.0
This example defines weighted_average
which calculates the weighted average of ‘Sales’ using the ‘Weight’ column. The agg()
method applies this function to the entire DataFrame.
Example 2: Custom Percentile
Pandas provides percentiles, but let’s create a function to calculate a custom percentile (e.g., the 85th percentile):
import numpy as np
import pandas as pd
= {'Values': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]}
data = pd.DataFrame(data)
df
def custom_percentile(series, percentile):
return np.percentile(series, percentile)
= df.agg({'Values': lambda x: custom_percentile(x, 85)})
percentile_85 print(percentile_85) #Output: Values 85.0
Here, custom_percentile
takes both the series and the desired percentile as input. Note the use of a lambda function for brevity.
Example 3: Multiple Aggregations with Custom Functions
We can combine multiple custom aggregations within a single agg()
call:
import pandas as pd
= {'Values': [10, 20, 30, 40, 50]}
data = pd.DataFrame(data)
df
def custom_sum(series):
return series.sum()
def custom_range(series):
return series.max() - series.min()
= df.agg({'Values': [custom_sum, custom_range]})
aggregated_data print(aggregated_data)
#Output: Values
This demonstrates the flexibility of using multiple custom functions within agg()
, providing a concise way to perform diverse calculations.