如何在Pandas中找到并筛选重复行?

如何在Pandas中找到并筛选重复行?

有时,在数据分析期间,我们需要看重复行以更好地理解数据,而不是直接删除它们。

幸运的是,在Pandas中,我们有几种方法可以处理重复行。

更多Pandas相关文章,请阅读:Pandas 教程

.duplciated()

此方法允许我们在DataFrame中提取重复行。我们将使用一个包含重复项的新数据集。

import pandas as pd
import numpy as np

# 使用特定列导入HR数据集
df = pd.read_csv("https://raw.githubusercontent.com/sasankac/TestDataSet/master/HRDataset.csv",
usecols = ("Employee_Name""PerformanceScore","Position","CitizenDesc"))

# 按雇员姓名排序并使其永久化
df.sort_values("Employee_Name"inplace = True)
df.head(3)
Python
Employee_Name Position CitizenDesc PerformanceScore
0 Adinolfi Production Technician I US Citizen Exceeds
1 Adinolfi Sr. DBA US Citizen Fully Meets
2 Adinolfi Production Technician II US Citizen Fully Meets

duplicated() 默认工作方式是通过 keep 参数,将每个值的第一次出现标记为非重复项。

如果一行不止一次重复出现,该方法不会将该行标记为重复项,而是将第一行后续的每行标记为重复项。混淆了吧?让我再用一个例子来解释一下,假设篮子里有 3 个苹果,这个方法将第一个苹果标记为非重复项,将其他两个苹果标记为重复项。

例子

df["Employee_Name"].head(3)
Python

输出

0 Adinolfi
1 Adinolfi
2 Adinolfi
Name: Employee_Name, dtype: object
Python

例子

df["Employee_Name"].duplicated().head(3)
Python

输出

0 False
1 True
2 True
Name: Employee_Name, dtype: bool
Python

现在,要提取重复项(记住第一次出现不是重复项,而是后续出现的才是重复项),我们需要将此方法传递到数据框中。

df.shape
Python
(310, 4)
Python
df[df["Employee_Name"].duplicated()]
Python
Employee_Name Position CitizenDesc PerformanceScore
1 Adinolfi Sr. DBA US Citizen Fully Meets
2 Adinolfi Production Technician II US Citizen Fully Meets
3 Adinolfi Production Technician I US Citizen Fully Meets
4 Adinolfi Production Manager US Citizen Fully Meets
6 Anderson Production Technician I US Citizen Exceeds
303 Wang Production Technician II US Citizen Fully Meets
304 Wang Production Technician II US Citizen Fully Meets
305 Wang Production Technician I US Citizen PIP
306 Wang CIO US Citizen Exceeds
307 Wang Data Analyst US Citizen Fully Meets

从上面的输出中,使用.duplicated()方法提取出的有79个重复值,共310行。

参数 -“last”

默认情况下,该方法将标记值的第一个出现为非重复项,我们可以通过传递keep = last参数来更改这种行为。

这个参数会将前两个苹果标记为重复项,将最后一个标记为非重复项。

df[df["Employee_Name"].duplicated(keep="last")]
Python
Employee_Name Position CitizenDesc PerformanceScore
0 Adinolfi Production Technician I US Citizen Exceeds
1 Adinolfi Sr. DBA US Citizen Fully Meets
2 Adinolfi Production Technician II US Citizen Fully Meets
3 Adinolfi Production Technician I US Citizen Fully Meets
5 Anderson Production Technician I US Citizen Fully Meets
302 Wang Production Technician II US Citizen Exceeds
303 Wang Production Technician II US Citizen Fully Meets
304 Wang Production Technician II US Citizen Fully Meets
305 Wang Production Technician I US Citizen PIP
306 Wang CIO US Citizen Exceeds

ARGUMENT – FALSE

keep参数还可以接受一个额外的”false”参数,将所有出现多次的值标记为重复项,例如在上面的示例中,所有的3个苹果均将被标记为重复项,而不仅是第一个或最后一个。

注意 – 在指定false参数时不要使用引号。

df[df"Employee_Name"].duplicated(keep=False)]
Python
Employee_Name Position CitizenDesc PerformanceScore
0 Adinolfi Production Technician I US Citizen Exceeds
1 Adinolfi Sr. DBA US Citizen Fully Meets
2 Adinolfi Production Technician II US Citizen Fully Meets
3 Adinolfi Production Technician I US Citizen Fully Meets
4 Adinolfi Production Manager US Citizen Fully Meets
303 Wang Production Technician II US Citizen Fully Meets
304 Wang Production Technician II US Citizen Fully Meets
305 Wang Production Technician I US Citizen PIP
306 Wang CIO US Citizen Exceeds
307 Wang Data Analyst US Citizen Fully Meets

现在最终,要从数据集中提取独特的值,我们可以使用“~”(波浪线)符号对值进行否定:

df_unique~df["Employee_Name"].duplicated(keep=False)df[df_unique]
Python
Employee_Name Position CitizenDesc PerformanceScore
7 Andreola Software Engineer US Citizen Fully Meets
25 Bozzi Production Manager US Citizen Fully Meets
26 Bramante Director of Operations US Citizen Exceeds
27 Brill Production Technician I US Citizen Fully Meets
34 Burkett Production Technician II US Citizen Fully Meets
276 Sweetwater Software Engineer US Citizen Exceeds
277 Szabo Production Technician I Non-Citizen Fully Meets
278 Tavares Production Technician II US Citizen Fully Meets
308 Zhou Production Technician I US Citizen Fully Meets
309 Zima NaN NaN NaN

drop_duplicates()

此方法与上一方法非常相似,但此方法可应用于DataFrame而不仅是单个系列。

注意:此方法查找DataFrame的所有列中的重复行并将其删除。

len(df)
Python

输出结果

310
Python
len(df.drop_duplicates())
Python

输出结果

290
Python

子集参数

子集参数接受作为字符串值的列名列表,我们可以在其中检查重复项。

df1=df.drop_duplicates(subset=["Employee_Name"],keep="first")df1
Python
Employee_Name Position CitizenDesc PerformanceScore
0 Adinolfi Production Technician I US Citizen Exceeds
5 Anderson Production Technician I US Citizen Fully Meets
7 Andreola Software Engineer US Citizen Fully Meets
14 Athwal Production Technician I US Citizen Fully Meets
20 Beak Production Technician I US Citizen Fully Meets
293 Von Massenbach Production Technician II US Citizen Fully Meets
295 Wallace Production Technician I US Citizen Needs Improvement
300 Wang Production Technician I Eligible NonCitizen Fully Meets
308 Zhou Production Technician I US Citizen Fully Meets
309 Zima NaN NaN NaN

我们可以指定多个列并使用上一节中讨论的所有保留参数。

df1=df.drop_duplicates(subset="Employee_Name""CitizenDesc"],keep=False)df1
Python
Employee_Name Position CitizenDesc PerformanceScore
7 Andreola Software Engineer US Citizen Fully Meets
16 Beak Production Technician I Eligible NonCitizen Fully Meets
25 Bozzi Production Manager US Citizen Fully Meets
26 Bramante Director of Operations US Citizen Exceeds
27 Brill Production Technician I US Citizen Fully Meets
287 Tejeda Network Engineer Eligible NonCitizen Fully Meets
286 Tejeda Software Engineer Non-Citizen Fully Meets
300 Wang Production Technician I Eligible NonCitizen Fully Meets
308 Zhou Production Technician I US Citizen Fully Meets
309 Zima NaN NaN NaN

unique() 方法

unique 方法查找系列中的唯一值,并将唯一值作为数组返回。此方法不排除缺失值。

len(df["Employee_Name"])
Python

输出

310
Python
df["Employee_Name"].unique()
Python
array(['Adinolfi', 'Anderson', 'Andreola', 'Athwal', 'Beak', 'Bondwell',
'Bozzi', 'Bramante', 'Brill', 'Brown', 'Burkett', 'Butler',
'Carabbio', 'Carey', 'Carr', 'Carter', 'Chace', 'Champaigne',
'Chan', 'Chang', 'Chivukula', 'Cierpiszewski', 'Cisco', 'Clayton',
'Cloninger', 'Close', 'Clukey', 'Cockel', 'Cole', 'Cornett',
'Costa', 'Crimmings', 'Daneault', 'Daniele', 'Darson', 'Davis',
'DeGweck', 'Del Bosque', 'Demita', 'Desimone', 'DiNocco',
'Dickinson', 'Dietrich', 'Digitale', 'Dobrin', 'Dolan', 'Dougall',
'Dunn', 'Eaton', 'Employee_Name', 'Engdahl', 'England', 'Erilus',
'Estremera', 'Evensen', 'Exantus', 'Faller', 'Fancett', 'Favis',
'Ferguson', 'Fernandes', 'Ferreira', 'Fidelia', 'Fitzpatrick',
'Foreman', 'Foss', 'Foster-Baker', 'Fraval', 'Friedman', 'Galia',
'Garcia', 'Garneau', 'Gaul', 'Gentry', 'Gerke', 'Gill', 'Gonzales',
'Gonzalez', 'Good', 'Handschiegl', 'Hankard', 'Harrison',
'Heitzman', 'Horton', 'Houlihan', 'Howard', 'Hubert', 'Hunts',
'Hutter', 'Huynh', 'Immediato', 'Ivey', 'Jackson', 'Jacobi',
'Jeannite', 'Jeremy Prater', 'Jhaveri', 'Johnson', 'Johnston',
'Jung', 'Kampew', 'Keatts', 'Khemmich', 'King', 'Kinsella',
'Kirill', 'Knapp', 'Kretschmer', 'LaRotonda', 'Lajiri', 'Langford',
'Langton', 'Latif', 'Le', 'LeBel', 'LeBlanc', 'Leach', 'Leruth',
'Liebig', 'Linares', 'Linden', 'Lindsay', 'Lundy', 'Lunquist',
'Lydon', 'Lynch', 'MacLennan', 'Mahoney', 'Manchester', 'Mancuso',
'Mangal', 'Martin', 'Martins', 'Maurice', 'McCarthy', 'McKinzie',
'Mckenna', 'Meads', 'Medeiros', 'Merlos', 'Miller', 'Monkfish',
'Monroe', 'Monterro', 'Moran', 'Morway', 'Motlagh', 'Moumanil',
'Mullaney', 'Murray', 'Navathe', 'Ndzi', 'Newman', 'Ngodup',
'Nguyen', 'Nowlan', 'O'hare', 'Oliver', 'Onque', 'Osturnka',
'Owad', 'Ozark', 'Panjwani', 'Patronick', 'Pearson', 'Pelech',
'Pelletier', 'Perry', 'Peters', 'Peterson', 'Petingill',
'Petrowsky', 'Pham', 'Pitt', 'Potts', 'Power', 'Punjabhi',
'Purinton', 'Quinn', 'Rachael', 'Rarrick', 'Rhoads', 'Riordan',
'Rivera', 'Roberson', 'Robertson', 'Robinson', 'Roby', 'Roehrich',
'Rogers', 'Roper', 'Rose', 'Rossetti', 'Roup', 'Ruiz', 'Saada',
'Saar-Beckles', 'Sadki', 'Sahoo', 'Salter', 'Sander', 'Semizoglou',
'Sewkumar', 'Shepard', 'Shields', 'Simard', 'Singh', 'Sloan',
'Smith', 'Soto', 'South', 'Sparks', 'Spirea', 'Squatrito',
'Stanford', 'Stanley', 'Steans', 'Stoica', 'Strong', 'Sullivan',
'Sutwell', 'Sweetwater', 'Szabo', 'Tavares', 'Tejeda', 'Veera',
'VonMassenbach', 'Wallace', 'Wang', 'Zhou', 'Zima'], dtype=object)
Python
len(df["Employee_Name"].unique())
Python

输出

231
Python

.nunique() 方法

此方法返回系列中唯一值的数量。默认情况下,此方法使用参数dropna = True来排除缺失值。

您可以将参数dropna的值设为False,以保留缺失值。

df["Employee_Name"].nunique()
Python

输出

231
Python
df["Employee_Name"].nunique(dropna=False)
Python

输出

231
Python

Python教程

Java教程

Web教程

数据库教程

图形图像教程

大数据教程

开发工具教程

计算机教程

登录

注册