Hierarchical Indexing

pandas
Published

June 18, 2023

Understanding Hierarchical Indexing

Imagine you have sales data across different regions and product categories. A simple index might only use a single column (e.g., product ID), making it cumbersome to filter by region and category simultaneously. Hierarchical indexing solves this by creating an index with multiple levels. For instance, you could have a hierarchical index with “Region” as the first level and “Category” as the second. This allows you to easily access data based on specific region-category combinations.

Creating a MultiIndex

There are several ways to create a MultiIndex in Pandas. Let’s start with the most common approach: using the MultiIndex.from_product() method.

import pandas as pd
import numpy as np

regions = ['North', 'South', 'East', 'West']
categories = ['Electronics', 'Clothing', 'Food']

index = pd.MultiIndex.from_product([regions, categories], names=['Region', 'Category'])

data = np.random.randint(100, 1000, size=(len(index),))

df = pd.DataFrame({'Sales': data}, index=index)
print(df)

This code generates a DataFrame where each row is uniquely identified by a region and category.

Accessing Data with Hierarchical Indexing

The power of MultiIndex becomes apparent when accessing and manipulating data. You can select data using various methods:

1. Using .loc with multiple labels:

print(df.loc[('North', 'Electronics')])

print(df.loc['South'])

2. Using .xs for cross-section selection:

The .xs() method allows you to select a cross-section of the data based on a specific level of the index.

print(df.xs('Clothing', level='Category'))

3. Partial string indexing (using slice):

print(df.loc[slice('North', 'South')] )

Creating a MultiIndex from a DataFrame

You can also create a MultiIndex directly from an existing DataFrame’s columns.

data = {'Region': ['North']*3 + ['South']*3,
        'Category': ['Electronics', 'Clothing', 'Food']*2,
        'Sales': np.random.randint(100, 1000, size=6)}
df2 = pd.DataFrame(data)
df2 = df2.set_index(['Region', 'Category'])
print(df2)

This demonstrates how you can easily transform existing columns into a MultiIndex, improving data organization and query efficiency.

Reshaping with Stack and Unstack

The stack() and unstack() methods provide powerful tools for reshaping data with MultiIndex. stack() pivots the innermost level of the columns into rows, while unstack() does the opposite. Experimenting with these methods will reveal their utility in transforming data structures.

stacked_df = df2.stack()
print(stacked_df)

unstacked_df = stacked_df.unstack()
print(unstacked_df)

These examples demonstrate how stack() and unstack() allow you to manipulate your data’s structure to better suit your analytical needs. Exploring these methods further will allow you to efficiently reshape your data.