Understanding the Problem: Lists within DataFrames
Imagine you have a DataFrame where a column contains lists of values. For example, let’s say you’re tracking purchases, and each row represents a customer with a list of items they bought:
import pandas as pd
= {'customer': ['A', 'B', 'C'],
data 'items': [['apple', 'banana'], ['orange'], ['grape', 'apple', 'kiwi']]}
= pd.DataFrame(data)
df print(df)
This will output:
customer items
0 A [apple, banana]
1 B [orange]
2 C [grape, apple, kiwi]
Analyzing this data directly is difficult. You can’t easily count the occurrences of each item or perform other analyses requiring individual item level data. This is where explode()
comes in handy.
Exploding the Lists: The explode()
Method
The explode()
method elegantly transforms this structure. It takes a column containing lists or arrays as input and expands it, creating a new row for each element within the lists:
= df.explode('items')
exploded_df print(exploded_df)
This produces:
customer items
0 A apple
0 A banana
1 B orange
2 C grape
2 C apple
2 C kiwi
Notice how each item in the ‘items’ column now occupies its own row, preserving the corresponding ‘customer’ information.
Handling Different Data Types
explode()
isn’t limited to lists. It works equally well with other iterable types like NumPy arrays:
import numpy as np
= {'customer': ['D', 'E'],
data2 'items': [np.array(['pear', 'mango']), np.array(['strawberry'])]}
= pd.DataFrame(data2)
df2 = df2.explode('items')
exploded_df2 print(exploded_df2)
This yields a similar result, demonstrating the flexibility of explode()
.
Exploding Multiple Columns
While the above examples focus on a single column, you can explode()
multiple columns simultaneously by passing a list of column names:
= {'customer': ['F', 'G'],
data3 'items': [['a', 'b'], ['c', 'd']],
'prices': [[1,2], [3,4]]}
= pd.DataFrame(data3)
df3 = df3.explode(['items', 'prices'])
exploded_df3 print(exploded_df3)
This expands both items
and prices
columns creating new rows for each combination of elements within the lists. Note that both columns must have the same list lengths within each row for this to work correctly. Otherwise, you’ll encounter an error.
Handling Non-list Values
If a cell contains a non-list/non-array value, it will be treated as a single element during the explosion. For example:
= {'customer': ['H', 'I'],
data4 'items': [['x', 'y'], 'z']}
= pd.DataFrame(data4)
df4 = df4.explode('items')
exploded_df4 print(exploded_df4)
This example shows that the single value ‘z’ is treated as a list containing a single element in the explode()
method.
Ignoring Errors with ignore_index
By default, the index is preserved during the explode operation. To reset the index, use ignore_index=True
.
= df.explode('items', ignore_index=True)
exploded_df5 print(exploded_df5)
This will produce a dataframe with a sequentially re-indexed output.