如何在Pandas中找到并筛选重复行?
有时,在数据分析期间,我们需要看重复行以更好地理解数据,而不是直接删除它们。
幸运的是,在Pandas中,我们有几种方法可以处理重复行。
更多Pandas相关文章,请阅读:Pandas 教程
.duplciated()
此方法允许我们在DataFrame中提取重复行。我们将使用一个包含重复项的新数据集。
import pandas as pd
import numpy as np
# 使用特定列导入HR数据集
df = pd.read_csv("https://raw.githubusercontent.com/sasankac/TestDataSet/master/HRDataset.csv",
usecols = ("Employee_Name""PerformanceScore","Position","CitizenDesc"))
# 按雇员姓名排序并使其永久化
df.sort_values("Employee_Name"inplace = True)
df.head(3)
| Employee_Name | Position | CitizenDesc | PerformanceScore | |
|---|---|---|---|---|
| 0 | Adinolfi | Production Technician I | US Citizen | Exceeds |
| 1 | Adinolfi | Sr. DBA | US Citizen | Fully Meets |
| 2 | Adinolfi | Production Technician II | US Citizen | Fully Meets |
duplicated() 默认工作方式是通过 keep 参数,将每个值的第一次出现标记为非重复项。
如果一行不止一次重复出现,该方法不会将该行标记为重复项,而是将第一行后续的每行标记为重复项。混淆了吧?让我再用一个例子来解释一下,假设篮子里有 3 个苹果,这个方法将第一个苹果标记为非重复项,将其他两个苹果标记为重复项。
例子
df["Employee_Name"].head(3)
输出
0 Adinolfi
1 Adinolfi
2 Adinolfi
Name: Employee_Name, dtype: object
例子
df["Employee_Name"].duplicated().head(3)
输出
0 False
1 True
2 True
Name: Employee_Name, dtype: bool
现在,要提取重复项(记住第一次出现不是重复项,而是后续出现的才是重复项),我们需要将此方法传递到数据框中。
df.shape
(310, 4)
df[df["Employee_Name"].duplicated()]
| Employee_Name | Position | CitizenDesc | PerformanceScore | |
|---|---|---|---|---|
| 1 | Adinolfi | Sr. DBA | US Citizen | Fully Meets |
| 2 | Adinolfi | Production Technician II | US Citizen | Fully Meets |
| 3 | Adinolfi | Production Technician I | US Citizen | Fully Meets |
| 4 | Adinolfi | Production Manager | US Citizen | Fully Meets |
| 6 | Anderson | Production Technician I | US Citizen | Exceeds |
| … | … | … | … | … |
| 303 | Wang | Production Technician II | US Citizen | Fully Meets |
| 304 | Wang | Production Technician II | US Citizen | Fully Meets |
| 305 | Wang | Production Technician I | US Citizen | PIP |
| 306 | Wang | CIO | US Citizen | Exceeds |
| 307 | Wang | Data Analyst | US Citizen | Fully Meets |
从上面的输出中,使用.duplicated()方法提取出的有79个重复值,共310行。
参数 -“last”
默认情况下,该方法将标记值的第一个出现为非重复项,我们可以通过传递keep = last参数来更改这种行为。
这个参数会将前两个苹果标记为重复项,将最后一个标记为非重复项。
df[df["Employee_Name"].duplicated(keep="last")]
| Employee_Name | Position | CitizenDesc | PerformanceScore | |
|---|---|---|---|---|
| 0 | Adinolfi | Production Technician I | US Citizen | Exceeds |
| 1 | Adinolfi | Sr. DBA | US Citizen | Fully Meets |
| 2 | Adinolfi | Production Technician II | US Citizen | Fully Meets |
| 3 | Adinolfi | Production Technician I | US Citizen | Fully Meets |
| 5 | Anderson | Production Technician I | US Citizen | Fully Meets |
| … | … | … | … | … |
| 302 | Wang | Production Technician II | US Citizen | Exceeds |
| 303 | Wang | Production Technician II | US Citizen | Fully Meets |
| 304 | Wang | Production Technician II | US Citizen | Fully Meets |
| 305 | Wang | Production Technician I | US Citizen | PIP |
| 306 | Wang | CIO | US Citizen | Exceeds |
ARGUMENT – FALSE
keep参数还可以接受一个额外的”false”参数,将所有出现多次的值标记为重复项,例如在上面的示例中,所有的3个苹果均将被标记为重复项,而不仅是第一个或最后一个。
注意 – 在指定false参数时不要使用引号。
df[df"Employee_Name"].duplicated(keep=False)]
| Employee_Name | Position | CitizenDesc | PerformanceScore | |
|---|---|---|---|---|
| 0 | Adinolfi | Production Technician I | US Citizen | Exceeds |
| 1 | Adinolfi | Sr. DBA | US Citizen | Fully Meets |
| 2 | Adinolfi | Production Technician II | US Citizen | Fully Meets |
| 3 | Adinolfi | Production Technician I | US Citizen | Fully Meets |
| 4 | Adinolfi | Production Manager | US Citizen | Fully Meets |
| … | … | … | … | … |
| 303 | Wang | Production Technician II | US Citizen | Fully Meets |
| 304 | Wang | Production Technician II | US Citizen | Fully Meets |
| 305 | Wang | Production Technician I | US Citizen | PIP |
| 306 | Wang | CIO | US Citizen | Exceeds |
| 307 | Wang | Data Analyst | US Citizen | Fully Meets |
现在最终,要从数据集中提取独特的值,我们可以使用“~”(波浪线)符号对值进行否定:
df_unique~df["Employee_Name"].duplicated(keep=False)df[df_unique]
| Employee_Name | Position | CitizenDesc | PerformanceScore | |
|---|---|---|---|---|
| 7 | Andreola | Software Engineer | US Citizen | Fully Meets |
| 25 | Bozzi | Production Manager | US Citizen | Fully Meets |
| 26 | Bramante | Director of Operations | US Citizen | Exceeds |
| 27 | Brill | Production Technician I | US Citizen | Fully Meets |
| 34 | Burkett | Production Technician II | US Citizen | Fully Meets |
| … | … | … | … | … |
| 276 | Sweetwater | Software Engineer | US Citizen | Exceeds |
| 277 | Szabo | Production Technician I | Non-Citizen | Fully Meets |
| 278 | Tavares | Production Technician II | US Citizen | Fully Meets |
| 308 | Zhou | Production Technician I | US Citizen | Fully Meets |
| 309 | Zima | NaN | NaN | NaN |
drop_duplicates()
此方法与上一方法非常相似,但此方法可应用于DataFrame而不仅是单个系列。
注意:此方法查找DataFrame的所有列中的重复行并将其删除。
len(df)
输出结果
310
len(df.drop_duplicates())
输出结果
290
子集参数
子集参数接受作为字符串值的列名列表,我们可以在其中检查重复项。
df1=df.drop_duplicates(subset=["Employee_Name"],keep="first")df1
| Employee_Name | Position | CitizenDesc | PerformanceScore | |
|---|---|---|---|---|
| 0 | Adinolfi | Production Technician I | US Citizen | Exceeds |
| 5 | Anderson | Production Technician I | US Citizen | Fully Meets |
| 7 | Andreola | Software Engineer | US Citizen | Fully Meets |
| 14 | Athwal | Production Technician I | US Citizen | Fully Meets |
| 20 | Beak | Production Technician I | US Citizen | Fully Meets |
| … | … | … | … | … |
| 293 | Von Massenbach | Production Technician II | US Citizen | Fully Meets |
| 295 | Wallace | Production Technician I | US Citizen | Needs Improvement |
| 300 | Wang | Production Technician I | Eligible NonCitizen | Fully Meets |
| 308 | Zhou | Production Technician I | US Citizen | Fully Meets |
| 309 | Zima | NaN | NaN | NaN |
我们可以指定多个列并使用上一节中讨论的所有保留参数。
df1=df.drop_duplicates(subset="Employee_Name""CitizenDesc"],keep=False)df1
| Employee_Name | Position | CitizenDesc | PerformanceScore | |
|---|---|---|---|---|
| 7 | Andreola | Software Engineer | US Citizen | Fully Meets |
| 16 | Beak | Production Technician I | Eligible NonCitizen | Fully Meets |
| 25 | Bozzi | Production Manager | US Citizen | Fully Meets |
| 26 | Bramante | Director of Operations | US Citizen | Exceeds |
| 27 | Brill | Production Technician I | US Citizen | Fully Meets |
| … | … | … | … | … |
| 287 | Tejeda | Network Engineer | Eligible NonCitizen | Fully Meets |
| 286 | Tejeda | Software Engineer | Non-Citizen | Fully Meets |
| 300 | Wang | Production Technician I | Eligible NonCitizen | Fully Meets |
| 308 | Zhou | Production Technician I | US Citizen | Fully Meets |
| 309 | Zima | NaN | NaN | NaN |
unique() 方法
unique 方法查找系列中的唯一值,并将唯一值作为数组返回。此方法不排除缺失值。
len(df["Employee_Name"])
输出
310
df["Employee_Name"].unique()
array(['Adinolfi', 'Anderson', 'Andreola', 'Athwal', 'Beak', 'Bondwell',
'Bozzi', 'Bramante', 'Brill', 'Brown', 'Burkett', 'Butler',
'Carabbio', 'Carey', 'Carr', 'Carter', 'Chace', 'Champaigne',
'Chan', 'Chang', 'Chivukula', 'Cierpiszewski', 'Cisco', 'Clayton',
'Cloninger', 'Close', 'Clukey', 'Cockel', 'Cole', 'Cornett',
'Costa', 'Crimmings', 'Daneault', 'Daniele', 'Darson', 'Davis',
'DeGweck', 'Del Bosque', 'Demita', 'Desimone', 'DiNocco',
'Dickinson', 'Dietrich', 'Digitale', 'Dobrin', 'Dolan', 'Dougall',
'Dunn', 'Eaton', 'Employee_Name', 'Engdahl', 'England', 'Erilus',
'Estremera', 'Evensen', 'Exantus', 'Faller', 'Fancett', 'Favis',
'Ferguson', 'Fernandes', 'Ferreira', 'Fidelia', 'Fitzpatrick',
'Foreman', 'Foss', 'Foster-Baker', 'Fraval', 'Friedman', 'Galia',
'Garcia', 'Garneau', 'Gaul', 'Gentry', 'Gerke', 'Gill', 'Gonzales',
'Gonzalez', 'Good', 'Handschiegl', 'Hankard', 'Harrison',
'Heitzman', 'Horton', 'Houlihan', 'Howard', 'Hubert', 'Hunts',
'Hutter', 'Huynh', 'Immediato', 'Ivey', 'Jackson', 'Jacobi',
'Jeannite', 'Jeremy Prater', 'Jhaveri', 'Johnson', 'Johnston',
'Jung', 'Kampew', 'Keatts', 'Khemmich', 'King', 'Kinsella',
'Kirill', 'Knapp', 'Kretschmer', 'LaRotonda', 'Lajiri', 'Langford',
'Langton', 'Latif', 'Le', 'LeBel', 'LeBlanc', 'Leach', 'Leruth',
'Liebig', 'Linares', 'Linden', 'Lindsay', 'Lundy', 'Lunquist',
'Lydon', 'Lynch', 'MacLennan', 'Mahoney', 'Manchester', 'Mancuso',
'Mangal', 'Martin', 'Martins', 'Maurice', 'McCarthy', 'McKinzie',
'Mckenna', 'Meads', 'Medeiros', 'Merlos', 'Miller', 'Monkfish',
'Monroe', 'Monterro', 'Moran', 'Morway', 'Motlagh', 'Moumanil',
'Mullaney', 'Murray', 'Navathe', 'Ndzi', 'Newman', 'Ngodup',
'Nguyen', 'Nowlan', 'O'hare', 'Oliver', 'Onque', 'Osturnka',
'Owad', 'Ozark', 'Panjwani', 'Patronick', 'Pearson', 'Pelech',
'Pelletier', 'Perry', 'Peters', 'Peterson', 'Petingill',
'Petrowsky', 'Pham', 'Pitt', 'Potts', 'Power', 'Punjabhi',
'Purinton', 'Quinn', 'Rachael', 'Rarrick', 'Rhoads', 'Riordan',
'Rivera', 'Roberson', 'Robertson', 'Robinson', 'Roby', 'Roehrich',
'Rogers', 'Roper', 'Rose', 'Rossetti', 'Roup', 'Ruiz', 'Saada',
'Saar-Beckles', 'Sadki', 'Sahoo', 'Salter', 'Sander', 'Semizoglou',
'Sewkumar', 'Shepard', 'Shields', 'Simard', 'Singh', 'Sloan',
'Smith', 'Soto', 'South', 'Sparks', 'Spirea', 'Squatrito',
'Stanford', 'Stanley', 'Steans', 'Stoica', 'Strong', 'Sullivan',
'Sutwell', 'Sweetwater', 'Szabo', 'Tavares', 'Tejeda', 'Veera',
'VonMassenbach', 'Wallace', 'Wang', 'Zhou', 'Zima'], dtype=object)
len(df["Employee_Name"].unique())
输出
231
.nunique() 方法
此方法返回系列中唯一值的数量。默认情况下,此方法使用参数dropna = True来排除缺失值。
您可以将参数dropna的值设为False,以保留缺失值。
df["Employee_Name"].nunique()
输出
231
df["Employee_Name"].nunique(dropna=False)
输出
231
极客教程