如何在Pandas中找到并筛选重复行?
有时,在数据分析期间,我们需要看重复行以更好地理解数据,而不是直接删除它们。
幸运的是,在Pandas中,我们有几种方法可以处理重复行。
更多Pandas相关文章,请阅读:Pandas 教程
.duplciated()
此方法允许我们在DataFrame中提取重复行。我们将使用一个包含重复项的新数据集。
Employee_Name | Position | CitizenDesc | PerformanceScore | |
---|---|---|---|---|
0 | Adinolfi | Production Technician I | US Citizen | Exceeds |
1 | Adinolfi | Sr. DBA | US Citizen | Fully Meets |
2 | Adinolfi | Production Technician II | US Citizen | Fully Meets |
duplicated() 默认工作方式是通过 keep 参数,将每个值的第一次出现标记为非重复项。
如果一行不止一次重复出现,该方法不会将该行标记为重复项,而是将第一行后续的每行标记为重复项。混淆了吧?让我再用一个例子来解释一下,假设篮子里有 3 个苹果,这个方法将第一个苹果标记为非重复项,将其他两个苹果标记为重复项。
例子
输出
例子
输出
现在,要提取重复项(记住第一次出现不是重复项,而是后续出现的才是重复项),我们需要将此方法传递到数据框中。
Employee_Name | Position | CitizenDesc | PerformanceScore | |
---|---|---|---|---|
1 | Adinolfi | Sr. DBA | US Citizen | Fully Meets |
2 | Adinolfi | Production Technician II | US Citizen | Fully Meets |
3 | Adinolfi | Production Technician I | US Citizen | Fully Meets |
4 | Adinolfi | Production Manager | US Citizen | Fully Meets |
6 | Anderson | Production Technician I | US Citizen | Exceeds |
… | … | … | … | … |
303 | Wang | Production Technician II | US Citizen | Fully Meets |
304 | Wang | Production Technician II | US Citizen | Fully Meets |
305 | Wang | Production Technician I | US Citizen | PIP |
306 | Wang | CIO | US Citizen | Exceeds |
307 | Wang | Data Analyst | US Citizen | Fully Meets |
从上面的输出中,使用.duplicated()方法提取出的有79个重复值,共310行。
参数 -“last”
默认情况下,该方法将标记值的第一个出现为非重复项,我们可以通过传递keep = last参数来更改这种行为。
这个参数会将前两个苹果标记为重复项,将最后一个标记为非重复项。
Employee_Name | Position | CitizenDesc | PerformanceScore | |
---|---|---|---|---|
0 | Adinolfi | Production Technician I | US Citizen | Exceeds |
1 | Adinolfi | Sr. DBA | US Citizen | Fully Meets |
2 | Adinolfi | Production Technician II | US Citizen | Fully Meets |
3 | Adinolfi | Production Technician I | US Citizen | Fully Meets |
5 | Anderson | Production Technician I | US Citizen | Fully Meets |
… | … | … | … | … |
302 | Wang | Production Technician II | US Citizen | Exceeds |
303 | Wang | Production Technician II | US Citizen | Fully Meets |
304 | Wang | Production Technician II | US Citizen | Fully Meets |
305 | Wang | Production Technician I | US Citizen | PIP |
306 | Wang | CIO | US Citizen | Exceeds |
ARGUMENT – FALSE
keep参数还可以接受一个额外的”false”参数,将所有出现多次的值标记为重复项,例如在上面的示例中,所有的3个苹果均将被标记为重复项,而不仅是第一个或最后一个。
注意 – 在指定false参数时不要使用引号。
Employee_Name | Position | CitizenDesc | PerformanceScore | |
---|---|---|---|---|
0 | Adinolfi | Production Technician I | US Citizen | Exceeds |
1 | Adinolfi | Sr. DBA | US Citizen | Fully Meets |
2 | Adinolfi | Production Technician II | US Citizen | Fully Meets |
3 | Adinolfi | Production Technician I | US Citizen | Fully Meets |
4 | Adinolfi | Production Manager | US Citizen | Fully Meets |
… | … | … | … | … |
303 | Wang | Production Technician II | US Citizen | Fully Meets |
304 | Wang | Production Technician II | US Citizen | Fully Meets |
305 | Wang | Production Technician I | US Citizen | PIP |
306 | Wang | CIO | US Citizen | Exceeds |
307 | Wang | Data Analyst | US Citizen | Fully Meets |
现在最终,要从数据集中提取独特的值,我们可以使用“~”(波浪线)符号对值进行否定:
Employee_Name | Position | CitizenDesc | PerformanceScore | |
---|---|---|---|---|
7 | Andreola | Software Engineer | US Citizen | Fully Meets |
25 | Bozzi | Production Manager | US Citizen | Fully Meets |
26 | Bramante | Director of Operations | US Citizen | Exceeds |
27 | Brill | Production Technician I | US Citizen | Fully Meets |
34 | Burkett | Production Technician II | US Citizen | Fully Meets |
… | … | … | … | … |
276 | Sweetwater | Software Engineer | US Citizen | Exceeds |
277 | Szabo | Production Technician I | Non-Citizen | Fully Meets |
278 | Tavares | Production Technician II | US Citizen | Fully Meets |
308 | Zhou | Production Technician I | US Citizen | Fully Meets |
309 | Zima | NaN | NaN | NaN |
drop_duplicates()
此方法与上一方法非常相似,但此方法可应用于DataFrame而不仅是单个系列。
注意:此方法查找DataFrame的所有列中的重复行并将其删除。
输出结果
输出结果
子集参数
子集参数接受作为字符串值的列名列表,我们可以在其中检查重复项。
Employee_Name | Position | CitizenDesc | PerformanceScore | |
---|---|---|---|---|
0 | Adinolfi | Production Technician I | US Citizen | Exceeds |
5 | Anderson | Production Technician I | US Citizen | Fully Meets |
7 | Andreola | Software Engineer | US Citizen | Fully Meets |
14 | Athwal | Production Technician I | US Citizen | Fully Meets |
20 | Beak | Production Technician I | US Citizen | Fully Meets |
… | … | … | … | … |
293 | Von Massenbach | Production Technician II | US Citizen | Fully Meets |
295 | Wallace | Production Technician I | US Citizen | Needs Improvement |
300 | Wang | Production Technician I | Eligible NonCitizen | Fully Meets |
308 | Zhou | Production Technician I | US Citizen | Fully Meets |
309 | Zima | NaN | NaN | NaN |
我们可以指定多个列并使用上一节中讨论的所有保留参数。
Employee_Name | Position | CitizenDesc | PerformanceScore | |
---|---|---|---|---|
7 | Andreola | Software Engineer | US Citizen | Fully Meets |
16 | Beak | Production Technician I | Eligible NonCitizen | Fully Meets |
25 | Bozzi | Production Manager | US Citizen | Fully Meets |
26 | Bramante | Director of Operations | US Citizen | Exceeds |
27 | Brill | Production Technician I | US Citizen | Fully Meets |
… | … | … | … | … |
287 | Tejeda | Network Engineer | Eligible NonCitizen | Fully Meets |
286 | Tejeda | Software Engineer | Non-Citizen | Fully Meets |
300 | Wang | Production Technician I | Eligible NonCitizen | Fully Meets |
308 | Zhou | Production Technician I | US Citizen | Fully Meets |
309 | Zima | NaN | NaN | NaN |
unique() 方法
unique 方法查找系列中的唯一值,并将唯一值作为数组返回。此方法不排除缺失值。
输出
输出
.nunique() 方法
此方法返回系列中唯一值的数量。默认情况下,此方法使用参数dropna = True来排除缺失值。
您可以将参数dropna的值设为False,以保留缺失值。