Understanding pd.cut
The pd.cut
function takes a one-dimensional array-like object (like a Pandas Series) and divides its values into a specified number of bins. These bins can be defined explicitly or automatically generated based on the data’s range. The result is a categorical Series, where each value is assigned to its corresponding bin.
Basic Usage: Equal-Width Bins
Let’s begin with a simple example. We’ll create a Series of numerical data and then bin it into four equal-width bins.
import pandas as pd
import numpy as np
= pd.Series(np.random.rand(10) * 100) # Generate 10 random numbers between 0 and 100
data print("Original Data:\n", data)
= pd.cut(data, bins=4)
bins print("\nData with 4 equal-width bins:\n", bins)
This code first generates a Series of random numbers. pd.cut(data, bins=4)
then divides these numbers into four bins of equal width. The output shows the original data and the categorical bins assigned to each data point.
Customizing Bin Edges
Instead of equal-width bins, you can define your own bin edges. This is useful when you want to create bins with specific ranges or meaningful intervals.
= [0, 25, 50, 75, 100] # Define custom bin edges
custom_bins = pd.cut(data, bins=custom_bins)
bins_custom print("\nData with custom bins:\n", bins_custom)
Here, we specify the exact boundaries of our bins. Note how data points now fall into bins defined by these custom edges.
Labeling Bins
For improved readability, you can assign labels to your bins.
= ['Low', 'Medium', 'High', 'Very High']
labels = pd.cut(data, bins=4, labels=labels)
bins_labeled print("\nData with labeled bins:\n", bins_labeled)
= pd.cut(data, bins=custom_bins, labels=labels)
bins_labeled_custom print("\nData with custom labeled bins:\n", bins_labeled_custom)
This makes the output much more intuitive, especially when presenting results to others.
Handling Right vs. Left Bin Boundaries (right
, include_lowest
)
By default, pd.cut
uses right-closed bins (inclusive of the right edge, exclusive of the left). You can change this behaviour using the right
parameter. The include_lowest
parameter allows inclusion of the lowest bin edge if needed.
= pd.cut(data, bins=4, right=True) # default
bins_right_closed print("\nRight-closed bins:\n",bins_right_closed)
= pd.cut(data, bins=4, right=False) #Left closed
bins_left_closed print("\nLeft-closed bins:\n",bins_left_closed)
#Example with include_lowest
= pd.cut(data, bins=custom_bins, include_lowest=True, right=False)
bins_include_lowest print("\nBins including lowest:\n", bins_include_lowest)
Understanding these parameters is important for accurate binning and analysis.
Frequency Counts with value_counts()
After binning, you can easily get the frequency counts of each bin using the value_counts()
method.
print("\nFrequency counts of labeled bins:\n", bins_labeled.value_counts())
This provides a summary of the distribution of data across the created bins. This is valuable for understanding data distribution and for further statistical analysis.
Handling Out-of-Bounds Values
If your data contains values outside the specified bin edges, pd.cut
will assign them to NaN
by default. You can control this behavior with the duplicates
parameter. This is extremely useful to handle edge cases and avoid unexpected results. Further exploration of this parameter is encouraged for robust data handling.