1. Beyond loc
and iloc
: Advanced Indexing
While loc
(label-based) and iloc
(integer-based) indexing are fundamental, Pandas offers more nuanced selection methods. Let’s explore some:
import pandas as pd
import numpy as np
= {'col1': [1, 2, 3, 4, 5],
data 'col2': [6, 7, 8, 9, 10],
'col3': ['A', 'B', 'C', 'D', 'E']}
= pd.DataFrame(data)
df
print("Rows where col1 > 2:\n", df[df['col1'] > 2])
print("\nRows where col1 > 2 and col2 < 9 using .query():\n", df.query('col1 > 2 and col2 < 9'))
print("\nSelecting first two rows and col1 and col3:\n", df.iloc[:2, [0,2]])
0, 'col1'] = 100
df.at[print("\nDataFrame after changing value at position 0, 'col1':\n", df)
2. Data Transformation with apply()
and applymap()
The apply()
and applymap()
methods provide powerful ways to transform data. apply()
operates on rows or columns, while applymap()
operates on individual elements.
def custom_function(row):
return row['col1'] * row['col2']
'col4'] = df.apply(custom_function, axis=1)
df[print("\nDataFrame after applying custom function:\n", df)
'col3'] = df['col3'].applymap(lambda x: x.lower())
df[print("\nDataFrame after applying applymap to lowercase col3:\n", df)
3. Efficient Data Cleaning with fillna()
and replace()
Missing data and inconsistent values are common challenges. Pandas provides excellent tools to address these.
= pd.DataFrame({'A': [1, np.nan, 3], 'B': [4, 5, np.nan]})
df_nan
= df_nan.fillna(0)
df_filled print("\nDataFrame after filling NaN with 0:\n", df_filled)
#Filling NaN values with Forward Fill
= df_nan.ffill()
df_ffill print("\nDataFrame after Forward Fill:\n", df_ffill)
= df_filled.replace(0, 100)
df_replaced print("\nDataFrame after replacing 0 with 100:\n", df_replaced)
4. Data Aggregation and Grouping with groupby()
The groupby()
method enables powerful data aggregation and analysis based on groups.
= df.groupby('col3')['col1'].mean()
grouped print("\nMean of col1 grouped by col3:\n", grouped)
5. Working with Time Series Data
Pandas excels in handling time series data. It offers functionalities for resampling, rolling calculations, and more.
= pd.date_range('20240101', periods=6)
dates = pd.Series(np.random.randn(6), index=dates)
ts print("\nTime Series Data:\n", ts)
= ts.resample('D').mean()
daily_ts print("\nResampled Time Series Data:\n", daily_ts)