Why Custom Aggregation Functions?
Standard aggregation functions are excellent for common tasks. But what if you need to calculate something more nuanced? For example:
- Weighted Averages: Calculating an average where different data points have varying weights.
- Custom Metrics: Defining a metric specific to your domain (e.g., a custom performance indicator).
- Complex Calculations: Combining multiple aggregations into a single, meaningful result.
- Data Cleaning/Transformation: Performing aggregation while simultaneously cleaning or transforming data within the aggregation process.
Creating Custom Aggregation Functions
The core concept is to write a Python function that takes a Pandas Series as input and returns a single value representing the aggregated result. This function is then passed to Pandas’ agg() method.
Example 1: Weighted Average
Let’s say we have sales data with both sales figures and associated weights (representing, perhaps, customer importance):
import pandas as pd
data = {'Sales': [100, 200, 300, 400],
'Weight': [0.1, 0.3, 0.4, 0.2]}
df = pd.DataFrame(data)
def weighted_average(series):
weights = series['Weight']
sales = series['Sales']
return (sales * weights).sum() / weights.sum()
weighted_avg = df.agg(weighted_average)
print(weighted_avg) #Output: Sales 260.0This example defines weighted_average which calculates the weighted average of ‘Sales’ using the ‘Weight’ column. The agg() method applies this function to the entire DataFrame.
Example 2: Custom Percentile
Pandas provides percentiles, but let’s create a function to calculate a custom percentile (e.g., the 85th percentile):
import numpy as np
import pandas as pd
data = {'Values': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]}
df = pd.DataFrame(data)
def custom_percentile(series, percentile):
return np.percentile(series, percentile)
percentile_85 = df.agg({'Values': lambda x: custom_percentile(x, 85)})
print(percentile_85) #Output: Values 85.0Here, custom_percentile takes both the series and the desired percentile as input. Note the use of a lambda function for brevity.
Example 3: Multiple Aggregations with Custom Functions
We can combine multiple custom aggregations within a single agg() call:
import pandas as pd
data = {'Values': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)
def custom_sum(series):
return series.sum()
def custom_range(series):
return series.max() - series.min()
aggregated_data = df.agg({'Values': [custom_sum, custom_range]})
print(aggregated_data)
#Output: ValuesThis demonstrates the flexibility of using multiple custom functions within agg(), providing a concise way to perform diverse calculations.