What are the methods for pandas deduplication? What are the methods for pandas deduplication?-Python Tutorial-php.cn

What are the methods to remove duplicates in pandas?

百草

Release： 2023-11-22 11:55:17

Original

3728 people have browsed it

The pandas deduplication methods are: 1. Use the drop_duplicates() method; 2. Use the duplicated() method; 3. Use the unique() method; 4. Use the value_counts() method. Detailed introduction: 1. Use the drop_duplicates() method to delete duplicate rows in the data frame and return a new data frame. It can set parameters to control how to perform deduplication, such as specifying the retention order and deduplication after deduplication. Time comparison columns and so on.

What are the methods to remove duplicates in pandas?

The operating system for this tutorial: Windows 10 system, DELL G3 computer.

Pandas is a powerful Python data analysis library that provides a variety of duplicate removal methods. The following are common methods for using Pandas to perform deduplication operations:

1. Use the drop_duplicates() method

The drop_duplicates() method is used to delete duplicate rows in the data frame and return a new data frame. It can set parameters to control how to perform deduplication, such as specifying the retention order after deduplication, comparison columns during deduplication, etc.

Sample code:

import pandas as pd  
  
df = pd.DataFrame({&#39;A&#39;: [1, 2, 1, 2, 3], &#39;B&#39;: [4, 5, 6, 7, 8]})  
df_unique = df.drop_duplicates()  # 默认情况下，按行进行去重，返回新的数据框

Copy after login

2. Use the duplicated() method

The duplicated() method is used to find duplicate rows in the data frame and return a Boolean series. It can set parameters to control how to perform deduplication, such as specifying the retention order after deduplication, comparison columns during deduplication, etc.

Sample code:

import pandas as pd  
  
df = pd.DataFrame({&#39;A&#39;: [1, 2, 1, 2, 3], &#39;B&#39;: [4, 5, 6, 7, 8]})  
df_unique = df[~df.duplicated()]  # 使用duplicated()方法查找重复的行，并使用逻辑非运算符返回不重复的行

Copy after login

3. Use the unique() method

The unique() method is used to return all unique values in the data frame and return a list or Series object. It can be used to deduplicate single or multiple columns.

Sample code:

import pandas as pd  
  
df = pd.DataFrame({&#39;A&#39;: [1, 2, 1, 2, 3], &#39;B&#39;: [4, 5, 6, 7, 8]})  
df_unique = df.apply(lambda x: pd.Series(x.unique()))  # 使用apply()方法对每一列进行去重，并返回一个Series对象

Copy after login

4. Use the value_counts() method

The value_counts() method is used to count the number of occurrences of each value in the data frame and return a Series object. It can be used to deduplicate single or multiple columns.

Sample code:

import pandas as pd  
  
df = pd.DataFrame({&#39;A&#39;: [1, 2, 1, 2, 3], &#39;B&#39;: [4, 5, 6, 7, 8]})  
df_unique = df.groupby(df.columns.tolist()).size().reset_index(name=&#39;counts&#39;)  # 对整个数据框进行分组计数，并返回一个Series对象，然后使用reset_index()方法转换为数据框格式，方便查看每个值的出现次数

Copy after login

The above is the detailed content of What are the methods to remove duplicates in pandas?. For more information, please follow other related articles on the PHP Chinese website!