Pandas Cut Method

pandas
Published

February 20, 2024

Understanding pd.cut

The pd.cut function takes a one-dimensional array-like object (like a Pandas Series) and divides its values into a specified number of bins. These bins can be defined explicitly or automatically generated based on the data’s range. The result is a categorical Series, where each value is assigned to its corresponding bin.

Basic Usage: Equal-Width Bins

Let’s begin with a simple example. We’ll create a Series of numerical data and then bin it into four equal-width bins.

import pandas as pd
import numpy as np

data = pd.Series(np.random.rand(10) * 100) # Generate 10 random numbers between 0 and 100
print("Original Data:\n", data)

bins = pd.cut(data, bins=4)
print("\nData with 4 equal-width bins:\n", bins)

This code first generates a Series of random numbers. pd.cut(data, bins=4) then divides these numbers into four bins of equal width. The output shows the original data and the categorical bins assigned to each data point.

Customizing Bin Edges

Instead of equal-width bins, you can define your own bin edges. This is useful when you want to create bins with specific ranges or meaningful intervals.

custom_bins = [0, 25, 50, 75, 100]  # Define custom bin edges
bins_custom = pd.cut(data, bins=custom_bins)
print("\nData with custom bins:\n", bins_custom)

Here, we specify the exact boundaries of our bins. Note how data points now fall into bins defined by these custom edges.

Labeling Bins

For improved readability, you can assign labels to your bins.

labels = ['Low', 'Medium', 'High', 'Very High']
bins_labeled = pd.cut(data, bins=4, labels=labels)
print("\nData with labeled bins:\n", bins_labeled)

bins_labeled_custom = pd.cut(data, bins=custom_bins, labels=labels)
print("\nData with custom labeled bins:\n", bins_labeled_custom)

This makes the output much more intuitive, especially when presenting results to others.

Handling Right vs. Left Bin Boundaries (right, include_lowest)

By default, pd.cut uses right-closed bins (inclusive of the right edge, exclusive of the left). You can change this behaviour using the right parameter. The include_lowest parameter allows inclusion of the lowest bin edge if needed.

bins_right_closed = pd.cut(data, bins=4, right=True) # default
print("\nRight-closed bins:\n",bins_right_closed)

bins_left_closed = pd.cut(data, bins=4, right=False) #Left closed
print("\nLeft-closed bins:\n",bins_left_closed)


#Example with include_lowest
bins_include_lowest = pd.cut(data, bins=custom_bins, include_lowest=True, right=False)
print("\nBins including lowest:\n", bins_include_lowest)

Understanding these parameters is important for accurate binning and analysis.

Frequency Counts with value_counts()

After binning, you can easily get the frequency counts of each bin using the value_counts() method.

print("\nFrequency counts of labeled bins:\n", bins_labeled.value_counts())

This provides a summary of the distribution of data across the created bins. This is valuable for understanding data distribution and for further statistical analysis.

Handling Out-of-Bounds Values

If your data contains values outside the specified bin edges, pd.cut will assign them to NaN by default. You can control this behavior with the duplicates parameter. This is extremely useful to handle edge cases and avoid unexpected results. Further exploration of this parameter is encouraged for robust data handling.