The pandas deduplication methods are: 1. Use the drop_duplicates() method; 2. Use the duplicated() method; 3. Use the unique() method; 4. Use the value_counts() method. Detailed introduction: 1. Use the drop_duplicates() method to delete duplicate rows in the data frame and return a new data frame. It can set parameters to control how to perform deduplication, such as specifying the retention order and deduplication after deduplication. Time comparison columns and so on.
The operating system for this tutorial: Windows 10 system, DELL G3 computer.
Pandas is a powerful Python data analysis library that provides a variety of duplicate removal methods. The following are common methods for using Pandas to perform deduplication operations:
1. Use the drop_duplicates() method
The drop_duplicates() method is used to delete duplicate rows in the data frame and return a new data frame. It can set parameters to control how to perform deduplication, such as specifying the retention order after deduplication, comparison columns during deduplication, etc.
Sample code:
import pandas as pd df = pd.DataFrame({'A': [1, 2, 1, 2, 3], 'B': [4, 5, 6, 7, 8]}) df_unique = df.drop_duplicates() # 默认情况下,按行进行去重,返回新的数据框
2. Use the duplicated() method
The duplicated() method is used to find duplicate rows in the data frame and return a Boolean series. It can set parameters to control how to perform deduplication, such as specifying the retention order after deduplication, comparison columns during deduplication, etc.
Sample code:
import pandas as pd df = pd.DataFrame({'A': [1, 2, 1, 2, 3], 'B': [4, 5, 6, 7, 8]}) df_unique = df[~df.duplicated()] # 使用duplicated()方法查找重复的行,并使用逻辑非运算符返回不重复的行
3. Use the unique() method
The unique() method is used to return all unique values in the data frame and return a list or Series object. It can be used to deduplicate single or multiple columns.
Sample code:
import pandas as pd df = pd.DataFrame({'A': [1, 2, 1, 2, 3], 'B': [4, 5, 6, 7, 8]}) df_unique = df.apply(lambda x: pd.Series(x.unique())) # 使用apply()方法对每一列进行去重,并返回一个Series对象
4. Use the value_counts() method
The value_counts() method is used to count the number of occurrences of each value in the data frame and return a Series object. It can be used to deduplicate single or multiple columns.
Sample code:
import pandas as pd df = pd.DataFrame({'A': [1, 2, 1, 2, 3], 'B': [4, 5, 6, 7, 8]}) df_unique = df.groupby(df.columns.tolist()).size().reset_index(name='counts') # 对整个数据框进行分组计数,并返回一个Series对象,然后使用reset_index()方法转换为数据框格式,方便查看每个值的出现次数
The above is the detailed content of What are the methods to remove duplicates in pandas?. For more information, please follow other related articles on the PHP Chinese website!