R语言识别和删除重复的数据

一个数据集可能有重复的值，为了保持它的无冗余性和准确性，重复的行需要被识别和删除。在这篇文章中，我们将看到如何在R中识别和删除重复的数据。首先，我们将检查重复的数据是否存在于我们的数据中，如果是的话，我们将删除它。

使用中的数据

在R中识别和删除重复的数据

识别重复的数据

为了识别，我们将使用diplicated()函数，它返回重复行的数量。

语法

duplicated(dataframe)

方法:

创建数据框
将其传递给diplicated()函数
这个函数以布尔值的形式返回重复的行。
应用sum函数来获得数字

例子。

# Creating a sample data frame of students 
# and their marks in respective subjects.
student_result=data.frame(name=c("Ram","Geeta","John","Paul",
                                 "Cassie","Geeta","Paul"),
                          maths=c(7,8,8,9,10,8,9),
                          science=c(5,7,6,8,9,7,8),
                          history=c(7,7,7,7,7,7,7))
  
# Printing data
student_result
duplicated(student_result)
sum(duplicated(student_result))

输出

重复(student_result)

[1] 假的假的假的假的真的真的

sum(replicated(student_result))

[1] 2

删除重复的数据

方法

创建数据框架
选择唯一的行
检索这些行
显示结果

方法1：使用unique()

我们使用unique()来获取数据中具有唯一值的行。

语法

unique(dataframe)

例子

# Creating a sample data frame of students 
# and their marks in respective subjects.
student_result=data.frame(name=c("Ram","Geeta","John","Paul",
                                 "Cassie","Geeta","Paul"),
                          maths=c(7,8,8,9,10,8,9),
                          science=c(5,7,6,8,9,7,8),
                          history=c(7,7,7,7,7,7,7))
  
# Printing data
student_result
unique(student_result)

输出

在R中识别和删除重复的数据

方法2：使用distinct()

应该安装软件包 “tidyverse”，并加载 “dplyr “库以使用distinct()。我们使用distinct()来获取数据中具有不同值的行。

语法

distinct(dataframe,keepall)

参数

dataframe：使用中的数据
keepall：决定要保留哪些变量

例子

# Creating a sample data frame of students and 
# their marks in respective subjects.
student_result=data.frame(name=c("Ram","Geeta","John","Paul",
                                 "Cassie","Geeta","Paul"),
                          maths=c(7,8,8,9,10,8,9),
                          science=c(5,7,6,8,9,7,8),
                          history=c(7,7,7,7,7,7,7))
  
# Printing data
student_result
distinct(student_result)

输出

在R中识别和删除重复的数据

例2： 以数学列为单位打印唯一行

# Creating a sample data frame of students and
# their marks in respective subjects.
student_result=data.frame(name=c("Ram","Geeta","John","Paul",
                                 "Cassie","Geeta","Paul"),
                          maths=c(7,8,8,9,10,8,9),
                          science=c(5,7,6,8,9,7,8),
                          history=c(7,7,7,7,7,7,7))
  
# Printing data
student_result
distinct(student_result,maths,.keep_all = TRUE)