Appending DataFrames

pandas
Published

August 5, 2023

Understanding the Problem: Why append is Slow

The append method, while seemingly straightforward, suffers from performance issues. Each call to append creates a new DataFrame, copying the existing data. This becomes computationally expensive with repeated appends, especially when dealing with substantial datasets.

import pandas as pd
import time

df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
df3 = pd.DataFrame({'A': [9, 10], 'B': [11, 12]})

start_time = time.time()
result_append = df1.append(df2, ignore_index=True).append(df3, ignore_index=True)
end_time = time.time()
print(f"Time taken using append: {end_time - start_time:.4f} seconds")
print(result_append)

The above code demonstrates the append method. Notice how we use ignore_index=True to reset the index after each append. While functional, this method is slow for numerous appends.

Superior Alternatives: concat for Efficiency

The pd.concat function offers a significantly more efficient solution. It concatenates DataFrames along a particular axis (0 for rows, 1 for columns). Crucially, concat performs the operation in a more optimized manner, reducing the overhead associated with repeated copying.

import pandas as pd
import time


start_time = time.time()
result_concat = pd.concat([df1, df2, df3], ignore_index=True)
end_time = time.time()
print(f"Time taken using concat: {end_time - start_time:.4f} seconds")
print(result_concat)

Observe the marked improvement in speed when using concat. This is the recommended approach for appending multiple DataFrames.

Handling Lists of DataFrames

If you have a list of DataFrames, concat effortlessly handles this scenario.

import pandas as pd
import time

dfs = [df1, df2, df3] #list of dataframes

start_time = time.time()
result_concat_list = pd.concat(dfs, ignore_index=True)
end_time = time.time()
print(f"Time taken using concat with list: {end_time - start_time:.4f} seconds")
print(result_concat_list)

Iterative Append: A Less Efficient but Flexible Approach

While less efficient than concat, an iterative approach using a loop and concat offers flexibility when appending DataFrames dynamically or conditionally.

import pandas as pd
import time

dfs = [df1, df2, df3]
result_iterative = pd.DataFrame()

start_time = time.time()
for df in dfs:
    result_iterative = pd.concat([result_iterative,df], ignore_index=True)
end_time = time.time()
print(f"Time taken using iterative concat: {end_time - start_time:.4f} seconds")
print(result_iterative)

This method shows how to build a DataFrame iteratively, although it’s less efficient than directly using concat on a list.

Choosing the Right Method

For optimal performance, always prioritize pd.concat when appending multiple DataFrames. The iterative approach is useful in scenarios requiring more dynamic control over the appending process. Avoid using append in loops for large datasets due to its significant performance drawbacks. Remember to use ignore_index=True to maintain a continuous index after concatenation.