NumPy Lexsort

numpy
Published

December 2, 2024

NumPy’s lexsort function offers a powerful way to sort arrays based on multiple sorting keys. Unlike the standard sort function which only sorts along a single axis, lexsort allows for sophisticated multi-level sorting, mirroring the behavior of lexicographical sorting used in dictionaries or text processing. This post will look into the intricacies of lexsort, providing clear explanations and practical examples.

Understanding Lexicographical Sorting

Imagine you have a list of names and ages, and you want to sort them first by age (ascending) and then by name (alphabetical). A simple sort won’t accomplish this. Lexicographical sorting, which lexsort emulates, solves this problem by prioritizing keys. In our example, age is the primary key, and name is the secondary key.

lexsort in Action: A Simple Example

Let’s start with a straightforward example. We’ll sort an array of names and ages:

import numpy as np

names = np.array(['Bob', 'Alice', 'Charlie', 'Bob'])
ages = np.array([30, 25, 30, 28])

ind = np.lexsort((names, ages))

sorted_ages = ages[ind]
sorted_names = names[ind]

print("Sorted Ages:", sorted_ages)
print("Sorted Names:", sorted_names)

This code first defines two NumPy arrays, names and ages. np.lexsort((names, ages)) performs the lexicographical sort. Note that lexsort takes a tuple of arrays as input. The order in the tuple dictates the sorting priority (rightmost array has the highest priority). The result ind is an array of indices that rearrange the original arrays to achieve the sorted order. We then use this index array to obtain the sorted ages and sorted_names.

Handling Multiple Keys

lexsort scales seamlessly to more than two keys. Let’s add a city to our example:

import numpy as np

names = np.array(['Bob', 'Alice', 'Charlie', 'Bob'])
ages = np.array([30, 25, 30, 28])
cities = np.array(['New York', 'London', 'Paris', 'Tokyo'])

ind = np.lexsort((names, ages, cities))  # City is the primary key

sorted_cities = cities[ind]
sorted_ages = ages[ind]
sorted_names = names[ind]

print("Sorted Cities:", sorted_cities)
print("Sorted Ages:", sorted_ages)
print("Sorted Names:", sorted_names)

Here, the city is the primary sorting key, followed by age and then name.

Descending Order

To sort in descending order for a specific key, simply reverse the array before passing it to lexsort. Let’s sort by age in descending order, then name in ascending order:

import numpy as np

names = np.array(['Bob', 'Alice', 'Charlie', 'Bob'])
ages = np.array([30, 25, 30, 28])

ind = np.lexsort((names, ages[::-1])) #ages is reversed for descending order

sorted_ages = ages[ind]
sorted_names = names[ind]

print("Sorted Ages (Descending, then Name Ascending):", sorted_ages)
print("Sorted Names:", sorted_names)

By reversing ages using [::-1], we achieve the desired descending sort for age.

Beyond Simple Arrays: Using Structured Arrays

lexsort works elegantly with NumPy’s structured arrays, providing a more structured and readable approach for multi-key sorting:

import numpy as np

data = np.array([('Bob', 30, 'New York'), ('Alice', 25, 'London'), ('Charlie', 30, 'Paris'), ('Bob', 28, 'Tokyo')],
               dtype=[('name', 'U10'), ('age', int), ('city', 'U20')])

ind = np.lexsort((data['name'], data['age'], data['city']))
sorted_data = data[ind]
print(sorted_data)

This example utilizes a structured array, making the code clearer and more maintainable when dealing with multiple data fields.

Advanced Uses and Considerations

lexsort is a powerful tool for complex sorting scenarios, but understanding its behavior with respect to data types and potential sorting instabilities is crucial for reliable results. Further exploration into these aspects will enhance your proficiency with this invaluable NumPy function.