Concatenating DataFrames

pandas
Published

January 23, 2023

Understanding DataFrame Concatenation

DataFrame concatenation involves joining DataFrames along a particular axis (rows or columns). The pd.concat() function is the primary tool for this task. It offers flexibility in how you combine your data, handling different scenarios efficiently.

Methods for Concatenating DataFrames

Let’s explore the various ways to use pd.concat(), illustrating each with code examples. We’ll start by creating some sample DataFrames:

import pandas as pd

df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'A': [4, 5, 6], 'B': [7, 8, 9]})
df3 = pd.DataFrame({'C': [7,8,9], 'D': [10,11,12]})

print("DataFrame 1:\n", df1)
print("\nDataFrame 2:\n", df2)
print("\nDataFrame 3:\n", df3)

Concatenating along rows (axis=0)

This is the default behavior of pd.concat(). It stacks DataFrames vertically, adding rows.

concatenated_df_rows = pd.concat([df1, df2])
print("\nConcatenated along rows:\n", concatenated_df_rows)

Concatenating along columns (axis=1)

This joins DataFrames horizontally, adding columns. Note that this requires the DataFrames to have the same number of rows.

concatenated_df_cols = pd.concat([df1, df3], axis=1)
print("\nConcatenated along columns:\n", concatenated_df_cols)

Handling Different Indices

If your DataFrames have different indices, pd.concat() will preserve them. You can reset the index using ignore_index=True.

df4 = pd.DataFrame({'A': [7,8,9], 'B': [10,11,12]}, index=[3,4,5])
concatenated_df_ignore_index = pd.concat([df1, df4], ignore_index=True)
print("\nConcatenated with ignore_index=True:\n", concatenated_df_ignore_index)

Concatenating with Keys

You can add a hierarchical index using the keys parameter, making it easier to identify the source of each DataFrame.

concatenated_df_keys = pd.concat([df1, df2], keys=['df1', 'df2'])
print("\nConcatenated with keys:\n", concatenated_df_keys)

Appending DataFrames

While pd.concat() is the general purpose function, append() provides a more concise way to add a single DataFrame to another. Note that .append() is now deprecated and it’s recommended to use concat instead. The following shows an example for context but its use is discouraged.

#Deprecated Method - use concat instead

Choosing the Right Method

The best method depends on your specific needs. If you’re adding rows, use axis=0 (the default). For adding columns, use axis=1. Consider ignore_index=True if you want a continuous index, and keys for hierarchical indexing when working with multiple DataFrames. Remember to always inspect your resulting DataFrame to ensure the concatenation happened as expected. Dealing with mismatched indices and column names carefully can significantly improve the success of your concatenation operations.