Understanding the digitize
Function
The numpy.digitize
function assigns each value in an input array to a bin based on a provided sequence of bin edges. Think of it as placing data points into pre-defined intervals or categories. The function returns an array of indices, where each index corresponds to the bin that each value belongs to.
Key Features:
- Input: Takes two main arguments: the input array (
x
) containing the values to be binned, and the bin edges (bins
) defining the boundaries of the bins. - Output: Returns an array of the same size as the input array, containing the indices of the bins each value falls into. The indices start at 1, not 0. A value equal to the rightmost bin edge is assigned to the last bin.
- Right-inclusive: By default,
digitize
considers bins to be right-inclusive. This means a value equal to a bin edge is assigned to that bin. You can change this behavior using theright
argument.
Code Examples: Putting digitize
to Work
Let’s explore digitize
through several examples:
Example 1: Basic Binning
Let’s say we have some exam scores and want to categorize them into letter grades:
import numpy as np
= np.array([60, 75, 82, 90, 55, 88, 70, 95, 85])
scores = np.array([60, 70, 80, 90, 100]) # Bin edges for grades F, D, C, B, A
bins
= np.digitize(scores, bins)
grade_indices print(grade_indices) # Output: [1 2 3 4 1 4 2 5 3]
This output shows that the first score (60) falls into bin 1 (F), the second (75) into bin 2 (D), and so on.
Example 2: Customizing Bin Edges and Right-Inclusiveness
We can create more granular bins and control whether the bins are right-inclusive or not:
import numpy as np
= np.array([1.2, 2.5, 3.7, 4.1, 5.9, 6.0])
data = np.array([1, 3, 5, 7])
bins
= np.digitize(data, bins)
indices_right print(f"Right-inclusive: {indices_right}") # Output: Right-inclusive: [1 2 2 3 3 4]
= np.digitize(data, bins, right=False)
indices_left print(f"Left-inclusive: {indices_left}") # Output: Left-inclusive: [1 2 3 3 4 4]
This illustrates how changing right
alters bin assignment, especially for values at bin edges.
Example 3: Handling Missing Values (NaN)
digitize
gracefully handles NaN
values:
import numpy as np
= np.array([1, 2, np.nan, 4, 5])
data = np.array([1, 3, 5])
bins
= np.digitize(data, bins)
indices print(indices) # Output: [1 1 0 2 3]
Notice that NaN
values are assigned an index of 0.
Example 4: Creating Histograms
digitize
is a foundational step in creating histograms:
import numpy as np
import matplotlib.pyplot as plt
= np.random.randn(1000)
data = np.linspace(-3, 3, 7) # 6 bins
bins
= np.digitize(data, bins)
bin_indices = np.bincount(bin_indices)[1:] #Ignore bin 0 (NaN)
counts
=bins)
plt.hist(data, bins plt.show()
This code generates a histogram using digitize
to determine the frequency of data points in each bin.
Beyond the Basics
The numpy.digitize
function is versatile and easily adaptable to a variety of data analysis tasks. By understanding its core functionality and exploring its parameters, you can effectively organize and analyze numerical data in your Python projects.