Pandas, the powerful Python data manipulation library, offers a fantastic tool for handling categorical data efficiently: the Categorical
data type. Categorical data represents values that can belong to a finite set of categories or levels. Understanding and utilizing Pandas’ Categorical
type significantly improves memory usage and speeds up various operations, especially when dealing with datasets containing many repeated categorical values.
Why Use Pandas Categoricals?
Before diving into the specifics, let’s highlight the key advantages of using Pandas Categorical
data:
Memory Efficiency: Categoricals store data more compactly than
object
dtype columns. They represent each unique category with a numerical code, reducing memory consumption dramatically, especially when dealing with large datasets and many repeated categories.Performance Boost: Operations on categorical data are often faster than on
object
dtype columns. This is because Pandas can optimize calculations based on the predefined categories.Improved Data Integrity: Using categoricals helps ensure data consistency by explicitly defining the valid categories. This prevents typos and inconsistencies that might creep in when working with textual representations.
Creating Categorical Data in Pandas
There are several ways to create categorical data in Pandas:
1. Using the pd.Categorical
constructor:
import pandas as pd
= ['apple', 'banana', 'apple', 'orange', 'banana', 'apple']
data
= pd.Categorical(data)
categories print(categories)
print(categories.categories) # access the unique categories
= pd.Series(categories)
series print(series)
#DataFrame with Categorical column
= pd.DataFrame({'fruit': categories})
df print(df)
2. Specifying categories during creation:
You can explicitly define the order of categories:
= pd.Categorical(data, categories=['banana', 'apple', 'orange'], ordered=True)
ordered_categories print(ordered_categories)
#ordered=True makes the category ordered, enabling comparison operations.
3. Converting existing columns to categorical:
= pd.DataFrame({'fruit': ['apple', 'banana', 'apple', 'orange', 'banana', 'apple']})
df 'fruit'] = pd.Categorical(df['fruit'])
df[print(df)
Working with Categorical Data
Once you have a categorical column, you can perform various operations:
1. Accessing Categories and Codes:
print(df['fruit'].cat.categories) # Access the categories
print(df['fruit'].cat.codes) # Access the numerical codes representing the categories.
2. Renaming Categories:
'fruit'] = df['fruit'].cat.rename_categories({'apple': 'Apple', 'banana': 'Banana'})
df[print(df)
3. Adding Categories:
'fruit'] = df['fruit'].cat.add_categories(['grape'])
df[print(df)
4. Removing Categories:
'fruit'] = df['fruit'].cat.remove_categories(['grape'])
df[print(df)
5. Setting Categories:
'fruit'] = df['fruit'].cat.set_categories(['Banana', 'Apple', 'Orange'])
df[print(df)
6. Sorting by Category:
Because we set ordered=True
earlier, we can easily sort:
= pd.Categorical(data, categories=['banana', 'apple', 'orange'], ordered=True)
ordered_categories = pd.Series(ordered_categories)
series print(series.sort_values())
These examples demonstrate the versatility and efficiency of Pandas’ Categorical
type. By leveraging this feature, you can significantly enhance the performance and manageability of your data analysis workflows.