在Pandas中处理缺失的数据
当没有提供一个或多个项目或整个单元的信息时,就会出现数据缺失。在现实生活中,缺失数据是一个非常大的问题。缺失数据在pandas中也可以称为NA(Not Available)值。在DataFrame中,有时许多数据集会出现缺失数据,这是因为它存在但没有被收集,或者它从未存在过。例如,假设被调查的不同用户可能选择不分享他们的收入,一些用户可能选择不分享地址,这样一来,许多数据集就会丢失。
在Pandas中,缺失的数据由两个值表示。
- None: None是一个Python单子对象,在Python代码中经常用于缺失数据。
- NaN:NaN(Not a Number的首字母缩写),是一个特殊的浮点值,所有使用标准IEEE浮点表示法的系统都能识别。
Pandas将None和NaN视为基本上可以互换的,用于表示缺失或空值。为了促进这一惯例,在Pandas DataFrame中,有几个有用的函数用于检测、移除和替换空值。
- isnull()
- notnull()
- dropna()
- fillna()
- replace()
- interpolate()
在这篇文章中,我们使用了CSV文件,要下载所使用的CSV文件,请点击这里。
使用isnull()和notnull()检查缺失的值
为了检查Pandas DataFrame中的缺失值,我们使用函数isnull()和notnull()。这两个函数都有助于检查一个值是否是NaN。这些函数也可以在Pandas系列中使用,以便在一个系列中找到空值。
使用isnull()检查缺失的值
为了检查Pandas数据帧中的空值,我们使用isnull()函数,该函数返回数据帧中的布尔值,即NaN值为真。代码 #1:
# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, 45, 56, np.nan],
'Third Score':[np.nan, 40, 80, 98]}
# creating a dataframe from list
df = pd.DataFrame(dict)
# using isnull() function
df.isnull()
输出:
代码#2:
# importing pandas package
import pandas as pd
# making data frame from csv file
data = pd.read_csv("employees.csv")
# creating bool series True for NaN values
bool_series = pd.isnull(data["Gender"])
# filtering data
# displaying data only with Gender = NaN
data[bool_series]
输出:
如输出图片所示,只有Gender = NULL的行被显示。
使用notnull()检查缺失值
为了检查Pandas数据帧中的空值,我们使用notnull()函数,该函数返回数据帧中的布尔值,对于NaN值来说是假的。代码 #3:
# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, 45, 56, np.nan],
'Third Score':[np.nan, 40, 80, 98]}
# creating a dataframe using dictionary
df = pd.DataFrame(dict)
# using notnull() function
df.notnull()
输出:
代码#4:
# importing pandas package
import pandas as pd
# making data frame from csv file
data = pd.read_csv("employees.csv")
# creating bool series True for NaN values
bool_series = pd.notnull(data["Gender"])
# filtering data
# displaying data only with Gender = Not NaN
data[bool_series]
输出:
如输出图片所示,只有具有Gender = NOT NULL的行被显示。
使用fillna()、replace()和interpolate()填补缺失值
为了填补数据集中的空值,我们使用fillna()、replace()和interpolate()函数,这些函数用它们自己的一些值替换NaN值。所有这些函数都有助于在DataFrame的数据集中填充空值。Interpolate()函数基本上用于填补数据框架中的空值,但它使用各种插值技术来填补缺失的值,而不是硬编码的值。代码#1:用单个值填充空值
# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, 45, 56, np.nan],
'Third Score':[np.nan, 40, 80, 98]}
# creating a dataframe from dictionary
df = pd.DataFrame(dict)
# filling missing value using fillna()
df.fillna(0)
输出:
代码#2:用前面的值填充空值
# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, 45, 56, np.nan],
'Third Score':[np.nan, 40, 80, 98]}
# creating a dataframe from dictionary
df = pd.DataFrame(dict)
# filling a missing value with
# previous ones
df.fillna(method ='pad')
输出:
代码#3:用下一个值填充空值
# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, 45, 56, np.nan],
'Third Score':[np.nan, 40, 80, 98]}
# creating a dataframe from dictionary
df = pd.DataFrame(dict)
# filling null value using fillna() function
df.fillna(method ='bfill')
输出:
代码#4: 填充CSV文件中的空数值
# importing pandas package
import pandas as pd
# making data frame from csv file
data = pd.read_csv("employees.csv")
# Printing the first 10 to 24 rows of
# the data frame for visualization
data[10:25]
# importing pandas package
import pandas as pd
# making data frame from csv file
data = pd.read_csv("employees.csv")
# filling a null values using fillna()
data["Gender"].fillna("No Gender", inplace = True)
data
输出:
代码#5:使用replace()方法填充一个空值。
# importing pandas package
import pandas as pd
# making data frame from csv file
data = pd.read_csv("employees.csv")
# Printing the first 10 to 24 rows of
# the data frame for visualization
data[10:25]
输出:
现在我们要把数据框中的所有Nan值替换成-99值。
# importing pandas package
import pandas as pd
# making data frame from csv file
data = pd.read_csv("employees.csv")
# will replace Nan value in dataframe with value -99
data.replace(to_replace = np.nan, value = -99)
输出:
代码#6:使用interpolate()函数,用线性方法填补缺失值。
# importing pandas as pd
import pandas as pd
# Creating the dataframe
df = pd.DataFrame({"A":[12, 4, 5, None, 1],
"B":[None, 2, 54, 3, None],
"C":[20, 16, None, 3, 8],
"D":[14, 3, None, None, 6]})
# Print the dataframe
df
# to interpolate the missing values
df.interpolate(method ='linear', limit_direction ='forward')
输出:
我们可以看到输出,第一行的数值无法得到填充,因为数值的填充方向是向前的,没有之前的数值可以用于插值。
使用dropna()删除缺失值
为了从数据框架中删除空值,我们使用dropna()函数,该函数以不同的方式删除有空值的数据集的行/列。代码#1:删除至少有一个空值的行。
# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, np.nan, 45, 56],
'Third Score':[52, 40, 80, 98],
'Fourth Score':[np.nan, np.nan, np.nan, 65]}
# creating a dataframe from dictionary
df = pd.DataFrame(dict)
df
# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, np.nan, 45, 56],
'Third Score':[52, 40, 80, 98],
'Fourth Score':[np.nan, np.nan, np.nan, 65]}
# creating a dataframe from dictionary
df = pd.DataFrame(dict)
# using dropna() function
df.dropna()
输出:
代码#2:如果该行的所有值都丢失,则放弃该行。
# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {'First Score':[100, np.nan, np.nan, 95],
'Second Score': [30, np.nan, 45, 56],
'Third Score':[52, np.nan, 80, 98],
'Fourth Score':[np.nan, np.nan, np.nan, 65]}
# creating a dataframe from dictionary
df = pd.DataFrame(dict)
df
# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {'First Score':[100, np.nan, np.nan, 95],
'Second Score': [30, np.nan, 45, 56],
'Third Score':[52, np.nan, 80, 98],
'Fourth Score':[np.nan, np.nan, np.nan, 65]}
df = pd.DataFrame(dict)
# using dropna() function
df.dropna(how = 'all')
输出:
代码#3:删除至少有一个空值的列。
# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {'First Score':[100, np.nan, np.nan, 95],
'Second Score': [30, np.nan, 45, 56],
'Third Score':[52, np.nan, 80, 98],
'Fourth Score':[60, 67, 68, 65]}
# creating a dataframe from dictionary
df = pd.DataFrame(dict)
df
# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {'First Score':[100, np.nan, np.nan, 95],
'Second Score': [30, np.nan, 45, 56],
'Third Score':[52, np.nan, 80, 98],
'Fourth Score':[60, 67, 68, 65]}
# creating a dataframe from dictionary
df = pd.DataFrame(dict)
# using dropna() function
df.dropna(axis = 1)
输出 :
代码 #4: 删除CSV文件中至少有一个空值的行
# importing pandas module
import pandas as pd
# making data frame from csv file
data = pd.read_csv("employees.csv")
# making new data frame with dropped NA values
new_data = data.dropna(axis = 0, how ='any')
new_data
输出:
现在我们比较数据框架的大小,这样我们就可以知道有多少行至少有1个空值。
print("Old data frame length:", len(data))
print("New data frame length:", len(new_data))
print("Number of rows with at least 1 NA value: ", (len(data)-len(new_data)))
输出 :
Old data frame length: 1000
New data frame length: 764
Number of rows with at least 1 NA value: 236
因为差值是236,所以有236行在任何一列中至少有一个空值。