NumPy Arrays: Boolean Indexing and Fancy Indexing
NumPy arrays provide several ways to select data efficiently. Boolean indexing and fancy indexing are particularly powerful.
Boolean Indexing: This technique uses boolean arrays to select elements. The boolean array’s shape must match the array you’re indexing.
import numpy as np
= np.array([10, 20, 30, 40, 50, 60])
arr = np.array([True, False, True, False, True, False])
bool_arr
= arr[bool_arr] # Selects elements where bool_arr is True
selected_elements print(selected_elements) # Output: [10 30 50]
= arr[(arr > 20) & (arr < 50)] # Elements greater than 20 and less than 50
selected_elements print(selected_elements) # Output: [30 40]
Fancy Indexing: This uses integer arrays to select elements at specific indices.
= np.array([10, 20, 30, 40, 50, 60])
arr = np.array([0, 2, 4])
indices = arr[indices] #Selects elements at indices 0, 2, and 4
selected_elements print(selected_elements) # Output: [10 30 50]
= np.array([0,1])
row_indices = np.array([1,2])
col_indices = np.array([[1,2,3],[4,5,6],[7,8,9]])
two_d_arr = two_d_arr[row_indices[:,None], col_indices]
selected_sub_array print(selected_sub_array) # Output: [[2 3] [5 6]]
Pandas DataFrames: loc
, iloc
, and Boolean Indexing
Pandas DataFrames, built on top of NumPy, offer even more sophisticated data selection methods. loc
is label-based indexing, iloc
is integer-based indexing, and boolean indexing works similarly to NumPy.
loc
(Label-based indexing):
import pandas as pd
= pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7,8,9]}, index=['x','y','z'])
df
#Selecting a single column
= df.loc[:,'A']
column_a print(column_a)
#Selecting multiple columns
= df.loc[:,['A','B']]
columns_a_b print(columns_a_b)
#Selecting rows and columns
= df.loc[['x','z'],['B','C']]
selected_data print(selected_data)
iloc
(Integer-based indexing):
#Selecting a single element
= df.iloc[1,0]
element print(element) # Output: 2
#Selecting multiple rows and columns
= df.iloc[[0,2],[1,2]]
selected_data print(selected_data)
Boolean Indexing with Pandas:
#Select rows where column 'A' is greater than 1
= df[df['A'] > 1]
selected_rows print(selected_rows)
#Combine multiple conditions
= df[(df['A'] > 1) & (df['B'] < 6)]
selected_rows print(selected_rows)
Performance Considerations
For large datasets, using optimized methods like boolean indexing and vectorized operations is significantly faster than iterating through rows or columns. Avoid explicit loops whenever possible. Pandas’ built-in functions often use vectorized operations for efficiency. Consider using optimized data structures like sparse matrices if your data has many missing values.