Understanding Cross Tabulation
A cross tabulation summarizes the frequency distribution of two or more categorical variables. It shows how many observations fall into each combination of categories. For instance, you might use it to analyze the relationship between gender and purchase preference, or between age group and voting behavior.
Basic Cross Tabulation with pd.crosstab()
Let’s start with a simple example. We’ll create a sample DataFrame:
import pandas as pd
= {'Gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Male', 'Female', 'Female'],
data 'Purchase': ['Yes', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes']}
= pd.DataFrame(data)
df print(df)
Now, let’s generate the cross tabulation:
= pd.crosstab(df['Gender'], df['Purchase'])
crosstab print(crosstab)
This will output a table showing the counts of males and females who purchased and did not purchase.
Adding Margins and Normalization
pd.crosstab()
offers several options to enhance the output. The margins
parameter adds row and column totals:
= pd.crosstab(df['Gender'], df['Purchase'], margins=True)
crosstab_margins print(crosstab_margins)
You can normalize the table to display proportions instead of counts. For example, to normalize by rows:
= pd.crosstab(df['Gender'], df['Purchase'], normalize='index')
crosstab_normalized print(crosstab_normalized)
This will show the proportion of purchases for each gender. You can also normalize by columns (normalize='columns'
) or the entire table (normalize='all'
).
Handling Multiple Variables
pd.crosstab()
can handle more than two variables. Let’s add another column to our DataFrame:
'AgeGroup'] = ['Young', 'Old', 'Young', 'Old', 'Young', 'Old', 'Young', 'Old']
data[= pd.DataFrame(data)
df print(df)
= pd.crosstab(df['Gender'], [df['Purchase'], df['AgeGroup']])
crosstab_multiple print(crosstab_multiple)
This creates a cross tabulation showing the relationship between gender and the combination of purchase and age group.
Using Aggfunc for More Complex Summaries
Instead of just counts, you can use the aggfunc
parameter to calculate other statistics:
import numpy as np
= pd.crosstab(df['Gender'], df['Purchase'], values=df['AgeGroup'], aggfunc=np.mean)
crosstab_mean print(crosstab_mean)
This shows average age group by gender and purchase status. Remember that values
must be specified for this to work properly. You can use many other aggregate functions from NumPy or other libraries as appropriate.
Customizing the Output
You can add labels for better readability:
= pd.crosstab(df['Gender'], df['Purchase'], rownames=['Gender'], colnames=['Purchase'])
crosstab_labeled print(crosstab_labeled)
This customizes the row and column names in the final cross-tabulation output. Experiment with different options to tailor your visualization to the needs of your data analysis.