Categoricals in Pandas

pandas
Published

October 3, 2024

Pandas, the powerful Python data manipulation library, offers a fantastic tool for handling categorical data efficiently: the Categorical data type. Categorical data represents values that can belong to a finite set of categories or levels. Understanding and utilizing Pandas’ Categorical type significantly improves memory usage and speeds up various operations, especially when dealing with datasets containing many repeated categorical values.

Why Use Pandas Categoricals?

Before diving into the specifics, let’s highlight the key advantages of using Pandas Categorical data:

  • Memory Efficiency: Categoricals store data more compactly than object dtype columns. They represent each unique category with a numerical code, reducing memory consumption dramatically, especially when dealing with large datasets and many repeated categories.

  • Performance Boost: Operations on categorical data are often faster than on object dtype columns. This is because Pandas can optimize calculations based on the predefined categories.

  • Improved Data Integrity: Using categoricals helps ensure data consistency by explicitly defining the valid categories. This prevents typos and inconsistencies that might creep in when working with textual representations.

Creating Categorical Data in Pandas

There are several ways to create categorical data in Pandas:

1. Using the pd.Categorical constructor:

import pandas as pd

data = ['apple', 'banana', 'apple', 'orange', 'banana', 'apple']

categories = pd.Categorical(data)
print(categories)
print(categories.categories) # access the unique categories

series = pd.Series(categories)
print(series)

#DataFrame with Categorical column
df = pd.DataFrame({'fruit': categories})
print(df)

2. Specifying categories during creation:

You can explicitly define the order of categories:

ordered_categories = pd.Categorical(data, categories=['banana', 'apple', 'orange'], ordered=True)
print(ordered_categories)
#ordered=True makes the category ordered, enabling comparison operations.

3. Converting existing columns to categorical:

df = pd.DataFrame({'fruit': ['apple', 'banana', 'apple', 'orange', 'banana', 'apple']})
df['fruit'] = pd.Categorical(df['fruit'])
print(df)

Working with Categorical Data

Once you have a categorical column, you can perform various operations:

1. Accessing Categories and Codes:

print(df['fruit'].cat.categories) # Access the categories
print(df['fruit'].cat.codes) # Access the numerical codes representing the categories.

2. Renaming Categories:

df['fruit'] = df['fruit'].cat.rename_categories({'apple': 'Apple', 'banana': 'Banana'})
print(df)

3. Adding Categories:

df['fruit'] = df['fruit'].cat.add_categories(['grape'])
print(df)

4. Removing Categories:

df['fruit'] = df['fruit'].cat.remove_categories(['grape'])
print(df)

5. Setting Categories:

df['fruit'] = df['fruit'].cat.set_categories(['Banana', 'Apple', 'Orange'])
print(df)

6. Sorting by Category:

Because we set ordered=True earlier, we can easily sort:

ordered_categories = pd.Categorical(data, categories=['banana', 'apple', 'orange'], ordered=True)
series = pd.Series(ordered_categories)
print(series.sort_values())

These examples demonstrate the versatility and efficiency of Pandas’ Categorical type. By leveraging this feature, you can significantly enhance the performance and manageability of your data analysis workflows.