Identifying Missing Values
Before you can fill missing values, you need to identify them. In Python, missing values are often represented as NaN
(Not a Number) in pandas DataFrames. We can easily locate them using the .isnull()
method:
import pandas as pd
import numpy as np
= {'A': [1, 2, np.nan, 4],
data 'B': [5, np.nan, 7, 8],
'C': [9, 10, 11, 12]}
= pd.DataFrame(data)
df
print(df.isnull())
This will output a boolean DataFrame indicating where the missing values are. We can also use .isna()
which is an alias for .isnull()
. To get the count of missing values per column, use .isnull().sum()
:
print(df.isnull().sum())
Filling Missing Values: Various Techniques
Several methods exist for filling missing values, each with its own advantages and disadvantages. The best approach depends on the nature of your data and the context of your analysis.
1. Using fillna()
The fillna()
method is a versatile tool offering several options:
- Replacing with a specific value:
= df.fillna(0) # Fill with 0
df_filled_zero = df['A'].fillna(df['A'].mean()) #Fill with column mean
df_filled_mean print(df_filled_zero)
print(df_filled_mean)
- Forward fill (
ffill
) and backward fill (bfill
): These methods propagate the last valid observation forward or backward.
= df.fillna(method='ffill')
df_ffill = df.fillna(method='bfill')
df_bfill print(df_ffill)
print(df_bfill)
- Interpolation: This method estimates missing values based on neighboring values.
= df.interpolate()
df_interpolated print(df_interpolated)
2. Using SimpleImputer from Scikit-learn
Scikit-learn’s SimpleImputer
provides a more structured way to handle missing values, particularly useful for preparing data for machine learning models:
from sklearn.impute import SimpleImputer
= SimpleImputer(strategy='mean') #Other strategies: 'median', 'most_frequent'
imputer = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
df_imputed print(df_imputed)
3. Advanced Imputation Techniques
For more complex scenarios, consider more sophisticated techniques like k-Nearest Neighbors imputation or model-based imputation (e.g., using a regression model to predict missing values). These methods are generally more computationally intensive but can provide more accurate results. Libraries like fancyimpute
offer implementations of these advanced techniques. However, these are beyond the scope of this introductory post.
Handling Missing Categorical Values
For categorical variables, fillna()
can be used with the most_frequent
strategy or you can replace missing values with a new category like “Unknown” or “Missing”.
'D'] = ['X','Y',np.nan,'Z']
df['D'] = df['D'].fillna('Missing')
df[print(df)