DataFrame Indexing

pandas
Published

July 31, 2024

Accessing Data: The Basics

Pandas offers several ways to access data within a DataFrame. The most common methods are using labels (column names and row indices) and integer-based location.

1. Using .loc for label-based indexing:

.loc allows you to access data using labels. This is generally preferred when you know the names of the columns and indices you want to access.

import pandas as pd

data = {'col1': [1, 2, 3], 'col2': [4, 5, 6], 'col3': [7, 8, 9]}
df = pd.DataFrame(data, index=['A', 'B', 'C'])

print(df.loc['B', 'col2'])  # Output: 5

print(df.loc[:, 'col1'])  # Output: A    1\nB    2\nC    3\nName: col1, dtype: int64

print(df.loc[:, ['col1', 'col3']])

print(df.loc['A'])

print(df.loc['A':'B', 'col1':'col2'])

2. Using .iloc for integer-based indexing:

.iloc uses integer positions to access data. This is useful when you need to select data based on its position regardless of labels.

import pandas as pd

data = {'col1': [1, 2, 3], 'col2': [4, 5, 6], 'col3': [7, 8, 9]}
df = pd.DataFrame(data)

print(df.iloc[1, 1])  # Output: 5

print(df.iloc[:, 0])  # Output: 0    1\n1    2\n2    3\nName: col1, dtype: int64

print(df.iloc[:, [0, 2]])

print(df.iloc[0])

print(df.iloc[0:2, 0:2])

3. Using [] for flexible indexing:

The square bracket notation [] offers a more flexible approach. It can sometimes use label-based indexing similar to .loc and integer-based indexing like .iloc, depending on the input. However, it is generally recommended to use .loc and .iloc explicitly for clarity and to avoid ambiguity.

import pandas as pd

data = {'col1': [1, 2, 3], 'col2': [4, 5, 6], 'col3': [7, 8, 9]}
df = pd.DataFrame(data)

print(df['col1'])

print(df[['col1', 'col3']])

print(df[0:2]) # This uses integer location, not labels.

Boolean Indexing

Boolean indexing allows you to select rows based on a condition. This is incredibly useful for filtering data.

import pandas as pd

data = {'col1': [1, 2, 3, 4, 5], 'col2': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)

print(df[df['col1'] > 2])

print(df[(df['col1'] > 2) & (df['col2'] < 40)])

Indexing with .at and .iat

For accessing single elements, .at (label-based) and .iat (integer-based) offer optimized access:

import pandas as pd

data = {'col1': [1, 2, 3], 'col2': [4, 5, 6]}
df = pd.DataFrame(data, index=['A', 'B', 'C'])

print(df.at['B', 'col2'])  # Output: 5
print(df.iat[1, 1])       # Output: 5