Pandas Drop Duplicates

pandas
Published

October 1, 2023

Understanding Duplicate Rows

Duplicate rows in a DataFrame are rows with identical values across all columns. Identifying and handling these duplicates is crucial for data cleaning and ensuring data integrity. Incorrect handling of duplicates can lead to skewed statistical analyses and flawed conclusions.

The drop_duplicates() Method

The drop_duplicates() method offers a flexible way to remove duplicate rows from your Pandas DataFrame. It returns a new DataFrame with duplicates removed, leaving the original DataFrame unchanged.

Basic Usage:

The simplest application drops all duplicate rows.

import pandas as pd

data = {'col1': [1, 2, 2, 3, 3, 3], 'col2': ['A', 'B', 'B', 'C', 'C', 'C']}
df = pd.DataFrame(data)

df_no_duplicates = df.drop_duplicates()
print(df_no_duplicates)

This will output:

   col1 col2
0     1    A
1     2    B
3     3    C

Specifying Subsets

Often, you might only want to consider specific columns when identifying duplicates. The subset parameter allows you to specify a list of column names. Only rows with identical values in the specified columns will be considered duplicates.

df_subset_duplicates = df.drop_duplicates(subset=['col1'])
print(df_subset_duplicates)

This will output:

   col1 col2
0     1    A
1     2    B
3     3    C

Here, duplicates are identified based solely on col1.

Keeping the First or Last Occurrence

By default, drop_duplicates() keeps the first occurrence of each unique row. The keep parameter controls this behavior:

  • 'first' (default): Keeps the first occurrence.
  • 'last' : Keeps the last occurrence.
  • False: Drops all duplicates.
df_keep_last = df.drop_duplicates(subset=['col1'], keep='last')
print(df_keep_last)

This will output:

   col1 col2
2     2    B
5     3    C

In-place Modification

To modify the DataFrame directly without creating a new one, use the inplace=True parameter. Caution: This modifies the original DataFrame.

df.drop_duplicates(subset=['col1'], keep='first', inplace=True)
print(df)

This will directly modify df.

Handling More Complex Scenarios

For more intricate duplicate handling, you might need to combine drop_duplicates() with other Pandas methods like boolean indexing or custom functions to pre-process your data before removing duplicates. This allows for more fine-grained control over which rows are considered duplicates.

Beyond Basic Duplicates

The drop_duplicates() method primarily focuses on exact matches across columns. For dealing with near-duplicates (e.g., slight variations in string values), techniques like fuzzy matching or string similarity measures are needed, which are beyond the scope of this basic introduction to drop_duplicates().