Pandas Explode Method – Mastering Python

Understanding the Problem: Lists within DataFrames

Imagine you have a DataFrame where a column contains lists of values. For example, let’s say you’re tracking purchases, and each row represents a customer with a list of items they bought:

import pandas as pd

data = {'customer': ['A', 'B', 'C'],
        'items': [['apple', 'banana'], ['orange'], ['grape', 'apple', 'kiwi']]}
df = pd.DataFrame(data)
print(df)

This will output:

  customer            items
0        A  [apple, banana]
1        B         [orange]
2        C  [grape, apple, kiwi]

Analyzing this data directly is difficult. You can’t easily count the occurrences of each item or perform other analyses requiring individual item level data. This is where explode() comes in handy.

Exploding the Lists: The `explode()` Method

The explode() method elegantly transforms this structure. It takes a column containing lists or arrays as input and expands it, creating a new row for each element within the lists:

exploded_df = df.explode('items')
print(exploded_df)

This produces:

  customer   items
0        A   apple
0        A  banana
1        B  orange
2        C   grape
2        C   apple
2        C    kiwi

Notice how each item in the ‘items’ column now occupies its own row, preserving the corresponding ‘customer’ information.

Handling Different Data Types

explode() isn’t limited to lists. It works equally well with other iterable types like NumPy arrays:

import numpy as np

data2 = {'customer': ['D', 'E'],
         'items': [np.array(['pear', 'mango']), np.array(['strawberry'])]}
df2 = pd.DataFrame(data2)
exploded_df2 = df2.explode('items')
print(exploded_df2)

This yields a similar result, demonstrating the flexibility of explode().

Exploding Multiple Columns

While the above examples focus on a single column, you can explode() multiple columns simultaneously by passing a list of column names:

data3 = {'customer': ['F', 'G'],
         'items': [['a', 'b'], ['c', 'd']],
         'prices': [[1,2], [3,4]]}

df3 = pd.DataFrame(data3)
exploded_df3 = df3.explode(['items', 'prices'])
print(exploded_df3)

This expands both items and prices columns creating new rows for each combination of elements within the lists. Note that both columns must have the same list lengths within each row for this to work correctly. Otherwise, you’ll encounter an error.

Handling Non-list Values

If a cell contains a non-list/non-array value, it will be treated as a single element during the explosion. For example:

data4 = {'customer': ['H', 'I'],
         'items': [['x', 'y'], 'z']}
df4 = pd.DataFrame(data4)
exploded_df4 = df4.explode('items')
print(exploded_df4)

This example shows that the single value ‘z’ is treated as a list containing a single element in the explode() method.

Ignoring Errors with `ignore_index`

By default, the index is preserved during the explode operation. To reset the index, use ignore_index=True.

exploded_df5 = df.explode('items', ignore_index=True)
print(exploded_df5)

This will produce a dataframe with a sequentially re-indexed output.