What are Aggregation Functions?
Aggregation functions perform calculations on a DataFrame, reducing multiple values into a single summarized value. This is invaluable for tasks like calculating sums, means, medians, and more across rows or columns. They help you gain insights quickly from large datasets without needing to manually iterate through each row or column.
Common Aggregation Functions
Pandas offers a wide array of built-in aggregation functions. Let’s explore some of the most commonly used:
1. sum()
: Calculating the Sum
The sum()
function calculates the sum of values across a specified axis (rows or columns).
import pandas as pd
= {'A': [1, 2, 3, 4, 5],
data 'B': [6, 7, 8, 9, 10]}
= pd.DataFrame(data)
df
= df['A'].sum()
column_sum_A print(f"Sum of column A: {column_sum_A}")
= df.sum(axis=1)
row_sum print(f"Sum of each row:\n{row_sum}")
2. mean()
: Calculating the Average
The mean()
function computes the average of values.
import pandas as pd
= {'A': [1, 2, 3, 4, 5],
data 'B': [6, 7, 8, 9, 10]}
= pd.DataFrame(data)
df
= df['B'].mean()
column_mean_B print(f"Mean of column B: {column_mean_B}")
= df.mean(axis=1)
row_mean print(f"Mean of each row:\n{row_mean}")
3. count()
: Counting Non-Missing Values
The count()
function counts the number of non-missing (non-NaN) values.
import pandas as pd
import numpy as np
= {'A': [1, 2, np.nan, 4, 5],
data 'B': [6, 7, 8, 9, 10]}
= pd.DataFrame(data)
df
= df['A'].count()
column_count_A print(f"Count of non-missing values in column A: {column_count_A}")
= df.count(axis=1)
row_count print(f"Count of non-missing values in each row:\n{row_count}")
4. median()
: Finding the Median
The median()
function calculates the median (middle value) of a series or DataFrame.
import pandas as pd
= {'A': [1, 2, 3, 4, 5],
data 'B': [6, 7, 8, 9, 10]}
= pd.DataFrame(data)
df
= df['A'].median()
column_median_A print(f"Median of column A: {column_median_A}")
= df.median(axis=1)
row_median print(f"Median of each row:\n{row_median}")
5. min()
and max()
: Finding Minimum and Maximum Values
The min()
and max()
functions find the minimum and maximum values, respectively.
import pandas as pd
= {'A': [1, 2, 3, 4, 5],
data 'B': [6, 7, 8, 9, 10]}
= pd.DataFrame(data)
df
= df['B'].min()
column_min_B print(f"Minimum of column B: {column_min_B}")
= df.max(axis=1)
row_max print(f"Maximum of each row:\n{row_max}")
6. std()
and var()
: Calculating Standard Deviation and Variance
The std()
and var()
functions calculate the standard deviation and variance, respectively, which measure the spread or dispersion of data.
import pandas as pd
= {'A': [1, 2, 3, 4, 5],
data 'B': [6, 7, 8, 9, 10]}
= pd.DataFrame(data)
df
= df['A'].std()
column_std_A print(f"Standard Deviation of column A: {column_std_A}")
= df.var(axis=1)
row_var print(f"Variance of each row:\n{row_var}")
Aggregation with agg()
The agg()
function allows for applying multiple aggregation functions simultaneously.
import pandas as pd
= {'A': [1, 2, 3, 4, 5],
data 'B': [6, 7, 8, 9, 10]}
= pd.DataFrame(data)
df
= df['A'].agg(['sum', 'mean', 'median'])
column_agg_A print(f"Multiple aggregations on column A:\n{column_agg_A}")
This is just a selection of the aggregation functions available in Pandas. Exploring the Pandas documentation will reveal further functionalities for more complex data analysis. Remember that axis=0
(default) aggregates columns, while axis=1
aggregates rows. Understanding this is crucial for obtaining the desired results.