Understanding Cross Tabulation
A cross tabulation summarizes the frequency distribution of two or more categorical variables. It shows how many observations fall into each combination of categories. For instance, you might use it to analyze the relationship between gender and purchase preference, or between age group and voting behavior.
Basic Cross Tabulation with pd.crosstab()
Let’s start with a simple example. We’ll create a sample DataFrame:
import pandas as pd
data = {'Gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Male', 'Female', 'Female'],
'Purchase': ['Yes', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes']}
df = pd.DataFrame(data)
print(df)Now, let’s generate the cross tabulation:
crosstab = pd.crosstab(df['Gender'], df['Purchase'])
print(crosstab)This will output a table showing the counts of males and females who purchased and did not purchase.
Adding Margins and Normalization
pd.crosstab() offers several options to enhance the output. The margins parameter adds row and column totals:
crosstab_margins = pd.crosstab(df['Gender'], df['Purchase'], margins=True)
print(crosstab_margins)You can normalize the table to display proportions instead of counts. For example, to normalize by rows:
crosstab_normalized = pd.crosstab(df['Gender'], df['Purchase'], normalize='index')
print(crosstab_normalized)This will show the proportion of purchases for each gender. You can also normalize by columns (normalize='columns') or the entire table (normalize='all').
Handling Multiple Variables
pd.crosstab() can handle more than two variables. Let’s add another column to our DataFrame:
data['AgeGroup'] = ['Young', 'Old', 'Young', 'Old', 'Young', 'Old', 'Young', 'Old']
df = pd.DataFrame(data)
print(df)
crosstab_multiple = pd.crosstab(df['Gender'], [df['Purchase'], df['AgeGroup']])
print(crosstab_multiple)This creates a cross tabulation showing the relationship between gender and the combination of purchase and age group.
Using Aggfunc for More Complex Summaries
Instead of just counts, you can use the aggfunc parameter to calculate other statistics:
import numpy as np
crosstab_mean = pd.crosstab(df['Gender'], df['Purchase'], values=df['AgeGroup'], aggfunc=np.mean)
print(crosstab_mean)This shows average age group by gender and purchase status. Remember that values must be specified for this to work properly. You can use many other aggregate functions from NumPy or other libraries as appropriate.
Customizing the Output
You can add labels for better readability:
crosstab_labeled = pd.crosstab(df['Gender'], df['Purchase'], rownames=['Gender'], colnames=['Purchase'])
print(crosstab_labeled)This customizes the row and column names in the final cross-tabulation output. Experiment with different options to tailor your visualization to the needs of your data analysis.