Comparing DataFrames for Differences in Rows
When comparing two dataframes with identical rows and columns, the simple comparison operation (df1 != df2) is sufficient. However, if the dataframes have different row sets, a different approach is needed to identify the differences.
Concat, Group, and Filter
One method to compare dataframes for row differences is to concatenate them, group by columns, and filter the unique rows. The following code illustrates this:
<code class="python">df = pd.concat([df1, df2]) df = df.reset_index(drop=True) df_gpby = df.groupby(list(df.columns)) idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1] result = df.reindex(idx)</code>
The concatenated dataframe (df) is grouped by all its columns (df_gpby). The 'groups.values()' method returns an iterable of tuples, where each tuple represents the indices of unique rows. Filtering the tuples by length (len(x) == 1) identifies the rows that exist in only one dataframe. Finally, reindexing the dataframe with the filtered indices (idx) produces a dataframe containing the row differences.
Example Output
Using the example dataframes provided:
>>> result Date Fruit Num Color 9 2013-11-25 Orange 8.6 Orange 8 2013-11-25 Apple 22.1 Red
This output shows the rows that are in df2 but not in df1.
The above is the detailed content of How to Compare DataFrames for Differences in Rows?. For more information, please follow other related articles on the PHP Chinese website!