Exponentially Slow Concatenation of DataFrames
When working with large datasets, it's common to partition the data into smaller chunks for efficient processing. However, concatenating these chunks back together can become exponentially slower as the number of chunks increases.
Cause of Slowdown
The slowdown is attributed to how pd.concat() is implemented. When called within a loop, it creates a new DataFrame for each concatenation, resulting in substantial data copying. This copying cost grows quadratically with the number of iterations, leading to the observed exponential increase in processing time.
Avoiding the Slowdown
To circumvent this performance bottleneck, it's crucial to avoid calling pd.concat() inside a for-loop. Instead, store the chunks in a list and concatenate them all at once after processing:
super_x = [] for i, df_chunk in enumerate(df_list): [x, y] = preprocess_data(df_chunk) super_x.append(x) super_x = pd.concat(super_x, axis=0)
Using this approach, the copying only occurs once, significantly reducing the overall processing time.
The above is the detailed content of Why is Concatenating Many Pandas DataFrames Exponentially Slow, and How Can I Avoid It?. For more information, please follow other related articles on the PHP Chinese website!