Cross Tabulation in Pandas – Mastering Python

Understanding Cross Tabulation

A cross tabulation summarizes the frequency distribution of two or more categorical variables. It shows how many observations fall into each combination of categories. For instance, you might use it to analyze the relationship between gender and purchase preference, or between age group and voting behavior.

Basic Cross Tabulation with `pd.crosstab()`

Let’s start with a simple example. We’ll create a sample DataFrame:

import pandas as pd

data = {'Gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Male', 'Female', 'Female'],
        'Purchase': ['Yes', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes']}
df = pd.DataFrame(data)
print(df)

Now, let’s generate the cross tabulation:

crosstab = pd.crosstab(df['Gender'], df['Purchase'])
print(crosstab)

This will output a table showing the counts of males and females who purchased and did not purchase.

Adding Margins and Normalization

pd.crosstab() offers several options to enhance the output. The margins parameter adds row and column totals:

crosstab_margins = pd.crosstab(df['Gender'], df['Purchase'], margins=True)
print(crosstab_margins)

You can normalize the table to display proportions instead of counts. For example, to normalize by rows:

crosstab_normalized = pd.crosstab(df['Gender'], df['Purchase'], normalize='index')
print(crosstab_normalized)

This will show the proportion of purchases for each gender. You can also normalize by columns (normalize='columns') or the entire table (normalize='all').

Handling Multiple Variables

pd.crosstab() can handle more than two variables. Let’s add another column to our DataFrame:

data['AgeGroup'] = ['Young', 'Old', 'Young', 'Old', 'Young', 'Old', 'Young', 'Old']
df = pd.DataFrame(data)
print(df)

crosstab_multiple = pd.crosstab(df['Gender'], [df['Purchase'], df['AgeGroup']])
print(crosstab_multiple)

This creates a cross tabulation showing the relationship between gender and the combination of purchase and age group.

Using Aggfunc for More Complex Summaries

Instead of just counts, you can use the aggfunc parameter to calculate other statistics:

import numpy as np

crosstab_mean = pd.crosstab(df['Gender'], df['Purchase'], values=df['AgeGroup'], aggfunc=np.mean)
print(crosstab_mean)

This shows average age group by gender and purchase status. Remember that values must be specified for this to work properly. You can use many other aggregate functions from NumPy or other libraries as appropriate.

Customizing the Output

You can add labels for better readability:

crosstab_labeled = pd.crosstab(df['Gender'], df['Purchase'], rownames=['Gender'], colnames=['Purchase'])
print(crosstab_labeled)

This customizes the row and column names in the final cross-tabulation output. Experiment with different options to tailor your visualization to the needs of your data analysis.