从Pandas数据框架中删除列中有缺失值或NaN的行
Pandas提供了各种数据结构和操作来处理数字数据和时间序列。然而,在某些情况下,可能会出现一些数据缺失的情况。在Pandas中,缺失的数据由两个值来表示。
- None:None是一个Python单子对象,在Python代码中经常用于缺失数据。
- NaN:NaN(Not a Number的首字母缩写),是一个特殊的浮点值,所有使用标准IEEE浮点表示法的系统都能识别。
Pandas认为None和NaN基本上可以互换,用于表示缺失或空值。为了从数据框架中删除空值,我们使用dropna()函数,该函数以不同的方式删除有空值的数据集的行/列。
语法:
DataFrame.dropna(axis=0, how=’any’, thresh=None, subset=None, inplace=False)
参数:
axis: axis对于行/列来说,使用的是int或string值。对于整数来说,输入可以是0或1;对于字符串来说,输入 “index “或 “columns”。
how:how只接受两种类型的字符串值(’any’或’all’)。如果任何值是空的,’any’将删除该行/列,’all’只在所有值都是空的情况下删除。
thresh:阈值为整数,它告诉人们要放弃的最小数量的na值。
subset:这是一个数组,它将丢弃过程限制在通过列表的行/列。
inplace:这是一个布尔值,如果为真,则在数据框架本身中进行更改。
代码#1:删除至少有1个空值的行。
# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, np.nan, 45, 56],
'Third Score':[52, 40, 80, 98],
'Fourth Score':[np.nan, np.nan, np.nan, 65]}
# creating a dataframe from dictionary
df = pd.DataFrame(dict)
df
现在我们删除至少有一个Nan值的行(Null值)。
# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, np.nan, 45, 56],
'Third Score':[52, 40, 80, 98],
'Fourth Score':[np.nan, np.nan, np.nan, 65]}
# creating a dataframe from dictionary
df = pd.DataFrame(dict)
# using dropna() function
df.dropna()
输出:
代码#2:如果该行的所有值都丢失,则放弃该行。
# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {'First Score':[100, np.nan, np.nan, 95],
'Second Score': [30, np.nan, 45, 56],
'Third Score':[52, np.nan, 80, 98],
'Fourth Score':[np.nan, np.nan, np.nan, 65]}
# creating a dataframe from dictionary
df = pd.DataFrame(dict)
df
现在我们删除所有数据缺失或包含空值(NaN)的行。
# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {'First Score':[100, np.nan, np.nan, 95],
'Second Score': [30, np.nan, 45, 56],
'Third Score':[52, np.nan, 80, 98],
'Fourth Score':[np.nan, np.nan, np.nan, 65]}
df = pd.DataFrame(dict)
# using dropna() function
df.dropna(how = 'all')
输出:
代码#3:放弃至少有一个空值的列。
# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {'First Score':[100, np.nan, np.nan, 95],
'Second Score': [30, np.nan, 45, 56],
'Third Score':[52, np.nan, 80, 98],
'Fourth Score':[60, 67, 68, 65]}
# creating a dataframe from dictionary
df = pd.DataFrame(dict)
df
现在,我们删除至少有一个缺失值的列
# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {'First Score':[100, np.nan, np.nan, 95],
'Second Score': [30, np.nan, 45, 56],
'Third Score':[52, np.nan, 80, 98],
'Fourth Score':[60, 67, 68, 65]}
# creating a dataframe from dictionary
df = pd.DataFrame(dict)
# using dropna() function
df.dropna(axis = 1)
输出 :
代码#4:在CSV文件中删除至少有1个空值的行。
# importing pandas module
import pandas as pd
# making data frame from csv file
data = pd.read_csv("employees.csv")
# making new data frame with dropped NA values
new_data = data.dropna(axis = 0, how ='any')
new_data
输出:
现在我们比较数据框的大小,这样我们就可以知道有多少行至少有一个空值。
print("Old data frame length:", len(data))
print("New data frame length:", len(new_data))
print("Number of rows with at least 1 NA value: ",
(len(data)-len(new_data)))
输出 :
Old data frame length: 1000
New data frame length: 764
Number of rows with at least 1 NA value: 236
因为差值是236,所以有236行在任何一列中至少有一个空值。