NumPy Digitize Function

numpy
Published

September 17, 2024

Understanding the digitize Function

The numpy.digitize function assigns each value in an input array to a bin based on a provided sequence of bin edges. Think of it as placing data points into pre-defined intervals or categories. The function returns an array of indices, where each index corresponds to the bin that each value belongs to.

Key Features:

  • Input: Takes two main arguments: the input array (x) containing the values to be binned, and the bin edges (bins) defining the boundaries of the bins.
  • Output: Returns an array of the same size as the input array, containing the indices of the bins each value falls into. The indices start at 1, not 0. A value equal to the rightmost bin edge is assigned to the last bin.
  • Right-inclusive: By default, digitize considers bins to be right-inclusive. This means a value equal to a bin edge is assigned to that bin. You can change this behavior using the right argument.

Code Examples: Putting digitize to Work

Let’s explore digitize through several examples:

Example 1: Basic Binning

Let’s say we have some exam scores and want to categorize them into letter grades:

import numpy as np

scores = np.array([60, 75, 82, 90, 55, 88, 70, 95, 85])
bins = np.array([60, 70, 80, 90, 100])  # Bin edges for grades F, D, C, B, A

grade_indices = np.digitize(scores, bins)
print(grade_indices)  # Output: [1 2 3 4 1 4 2 5 3]

This output shows that the first score (60) falls into bin 1 (F), the second (75) into bin 2 (D), and so on.

Example 2: Customizing Bin Edges and Right-Inclusiveness

We can create more granular bins and control whether the bins are right-inclusive or not:

import numpy as np

data = np.array([1.2, 2.5, 3.7, 4.1, 5.9, 6.0])
bins = np.array([1, 3, 5, 7])

indices_right = np.digitize(data, bins)
print(f"Right-inclusive: {indices_right}")  # Output: Right-inclusive: [1 2 2 3 3 4]

indices_left = np.digitize(data, bins, right=False)
print(f"Left-inclusive: {indices_left}")  # Output: Left-inclusive: [1 2 3 3 4 4]

This illustrates how changing right alters bin assignment, especially for values at bin edges.

Example 3: Handling Missing Values (NaN)

digitize gracefully handles NaN values:

import numpy as np

data = np.array([1, 2, np.nan, 4, 5])
bins = np.array([1, 3, 5])

indices = np.digitize(data, bins)
print(indices) # Output: [1 1 0 2 3]

Notice that NaN values are assigned an index of 0.

Example 4: Creating Histograms

digitize is a foundational step in creating histograms:

import numpy as np
import matplotlib.pyplot as plt

data = np.random.randn(1000)
bins = np.linspace(-3, 3, 7) # 6 bins

bin_indices = np.digitize(data, bins)
counts = np.bincount(bin_indices)[1:] #Ignore bin 0 (NaN)

plt.hist(data, bins=bins)
plt.show()

This code generates a histogram using digitize to determine the frequency of data points in each bin.

Beyond the Basics

The numpy.digitize function is versatile and easily adaptable to a variety of data analysis tasks. By understanding its core functionality and exploring its parameters, you can effectively organize and analyze numerical data in your Python projects.