Efficient Data Selection

pandas
Published

December 9, 2024

NumPy Arrays: Boolean Indexing and Fancy Indexing

NumPy arrays provide several ways to select data efficiently. Boolean indexing and fancy indexing are particularly powerful.

Boolean Indexing: This technique uses boolean arrays to select elements. The boolean array’s shape must match the array you’re indexing.

import numpy as np

arr = np.array([10, 20, 30, 40, 50, 60])
bool_arr = np.array([True, False, True, False, True, False])

selected_elements = arr[bool_arr]  # Selects elements where bool_arr is True
print(selected_elements)  # Output: [10 30 50]

selected_elements = arr[(arr > 20) & (arr < 50)] # Elements greater than 20 and less than 50
print(selected_elements) # Output: [30 40]

Fancy Indexing: This uses integer arrays to select elements at specific indices.

arr = np.array([10, 20, 30, 40, 50, 60])
indices = np.array([0, 2, 4])
selected_elements = arr[indices] #Selects elements at indices 0, 2, and 4
print(selected_elements) # Output: [10 30 50]

row_indices = np.array([0,1])
col_indices = np.array([1,2])
two_d_arr = np.array([[1,2,3],[4,5,6],[7,8,9]])
selected_sub_array = two_d_arr[row_indices[:,None], col_indices]
print(selected_sub_array) # Output: [[2 3] [5 6]]

Pandas DataFrames: loc, iloc, and Boolean Indexing

Pandas DataFrames, built on top of NumPy, offer even more sophisticated data selection methods. loc is label-based indexing, iloc is integer-based indexing, and boolean indexing works similarly to NumPy.

loc (Label-based indexing):

import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7,8,9]}, index=['x','y','z'])

#Selecting a single column
column_a = df.loc[:,'A']
print(column_a)

#Selecting multiple columns
columns_a_b = df.loc[:,['A','B']]
print(columns_a_b)

#Selecting rows and columns
selected_data = df.loc[['x','z'],['B','C']]
print(selected_data)

iloc (Integer-based indexing):

#Selecting a single element
element = df.iloc[1,0]
print(element) # Output: 2

#Selecting multiple rows and columns
selected_data = df.iloc[[0,2],[1,2]]
print(selected_data)

Boolean Indexing with Pandas:

#Select rows where column 'A' is greater than 1
selected_rows = df[df['A'] > 1]
print(selected_rows)

#Combine multiple conditions
selected_rows = df[(df['A'] > 1) & (df['B'] < 6)]
print(selected_rows)

Performance Considerations

For large datasets, using optimized methods like boolean indexing and vectorized operations is significantly faster than iterating through rows or columns. Avoid explicit loops whenever possible. Pandas’ built-in functions often use vectorized operations for efficiency. Consider using optimized data structures like sparse matrices if your data has many missing values.