Merging DataFrames

pandas
Published

July 18, 2023

Understanding DataFrame Merging

Pandas offers flexible and efficient tools for combining DataFrames based on shared columns or indices. The core function for this is pd.merge(), which offers several options to control the merging behavior. Before we dive into the specifics, let’s establish the fundamental concepts:

  • Inner Join: Returns only the rows where the join key exists in both DataFrames.
  • Outer Join: Returns all rows from both DataFrames. Missing values are filled with NaN where there’s no match.
  • Left Join: Returns all rows from the left DataFrame and matching rows from the right DataFrame. Unmatched rows from the right DataFrame are omitted.
  • Right Join: Returns all rows from the right DataFrame and matching rows from the left DataFrame. Unmatched rows from the left DataFrame are omitted.

Merging DataFrames using pd.merge()

Let’s start with a practical example. We’ll create two sample DataFrames:

import pandas as pd

df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value1': [1, 2, 3, 4]})
df2 = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value2': [5, 6, 7, 8]})

print("DataFrame 1:\n", df1)
print("\nDataFrame 2:\n", df2)

This will output two DataFrames:

DataFrame 1:
   key  value1
0   A       1
1   B       2
2   C       3
3   D       4

DataFrame 2:
   key  value2
0   B       5
1   D       6
2   E       7
3   F       8

Now let’s merge them using different join types:

Inner Join:

merged_inner = pd.merge(df1, df2, on='key', how='inner')
print("\nInner Join:\n", merged_inner)

This will only include rows where ‘key’ exists in both DataFrames:

Inner Join:
   key  value1  value2
0   B       2       5
1   D       4       6

Left Join:

merged_left = pd.merge(df1, df2, on='key', how='left')
print("\nLeft Join:\n", merged_left)

This includes all rows from df1, filling in value2 with NaN where there’s no match in df2:

Left Join:
   key  value1  value2
0   A       1    NaN
1   B       2    5.0
2   C       3    NaN
3   D       4    6.0

Right Join:

merged_right = pd.merge(df1, df2, on='key', how='right')
print("\nRight Join:\n", merged_right)

This mirrors the left join but with priorities reversed:

Right Join:
   key  value1  value2
0   B     2.0       5
1   D     4.0       6
2   E     NaN       7
3   F     NaN       8

Outer Join:

merged_outer = pd.merge(df1, df2, on='key', how='outer')
print("\nOuter Join:\n", merged_outer)

This includes all rows from both DataFrames:

Outer Join:
   key  value1  value2
0   A     1.0    NaN
1   B     2.0    5.0
2   C     3.0    NaN
3   D     4.0    6.0
4   E     NaN    7.0
5   F     NaN    8.0

Merging on Multiple Keys

You can merge on multiple keys by providing a list to the on parameter:

#Example with multiple keys (requires adjusting the sample DataFrames) - omitted for brevity due to length constraints.

Handling Different Key Names

If your join keys have different names in each DataFrame, use the left_on and right_on parameters:

#Example with different key names (requires adjusting the sample DataFrames) - omitted for brevity due to length constraints.