Fill Missing Values

pandas
Published

March 22, 2024

Identifying Missing Values

Before you can fill missing values, you need to identify them. In Python, missing values are often represented as NaN (Not a Number) in pandas DataFrames. We can easily locate them using the .isnull() method:

import pandas as pd
import numpy as np

data = {'A': [1, 2, np.nan, 4], 
        'B': [5, np.nan, 7, 8], 
        'C': [9, 10, 11, 12]}
df = pd.DataFrame(data)

print(df.isnull())

This will output a boolean DataFrame indicating where the missing values are. We can also use .isna() which is an alias for .isnull(). To get the count of missing values per column, use .isnull().sum():

print(df.isnull().sum())

Filling Missing Values: Various Techniques

Several methods exist for filling missing values, each with its own advantages and disadvantages. The best approach depends on the nature of your data and the context of your analysis.

1. Using fillna()

The fillna() method is a versatile tool offering several options:

  • Replacing with a specific value:
df_filled_zero = df.fillna(0)  # Fill with 0
df_filled_mean = df['A'].fillna(df['A'].mean()) #Fill with column mean
print(df_filled_zero)
print(df_filled_mean)
  • Forward fill (ffill) and backward fill (bfill): These methods propagate the last valid observation forward or backward.
df_ffill = df.fillna(method='ffill')
df_bfill = df.fillna(method='bfill')
print(df_ffill)
print(df_bfill)
  • Interpolation: This method estimates missing values based on neighboring values.
df_interpolated = df.interpolate()
print(df_interpolated)

2. Using SimpleImputer from Scikit-learn

Scikit-learn’s SimpleImputer provides a more structured way to handle missing values, particularly useful for preparing data for machine learning models:

from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='mean') #Other strategies: 'median', 'most_frequent'
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
print(df_imputed)

3. Advanced Imputation Techniques

For more complex scenarios, consider more sophisticated techniques like k-Nearest Neighbors imputation or model-based imputation (e.g., using a regression model to predict missing values). These methods are generally more computationally intensive but can provide more accurate results. Libraries like fancyimpute offer implementations of these advanced techniques. However, these are beyond the scope of this introductory post.

Handling Missing Categorical Values

For categorical variables, fillna() can be used with the most_frequent strategy or you can replace missing values with a new category like “Unknown” or “Missing”.

df['D'] = ['X','Y',np.nan,'Z']
df['D'] = df['D'].fillna('Missing')
print(df)