Pandas 缺失值处理|极客教程

Pandas 缺失值处理，补上缺失值很容易，在数据结构中用NaN来表示，在数据分析过程中，有些元素在某个数据结构中没有定义，这种情况是很常见的。本章介绍缺失值的处理方法，这样许多问题就可以避免。比如Pandas 库在计算各种描述性统计量时，其实并没有考虑NaN值。

为元素赋NaN值

有时候需要为数据结构中的元素赋值为NaN，这时用Numpy中的np.NaN 或 np.nan即可。如下例所示：

import pandas as pd
import numpy as np

ser = pd.Series([0, 1, 2, np.NaN, 9], index=['red', 'blue', 'yellow', 'white', 'green'])
print(ser)
print('-------------')
ser['white'] = None
print(ser)

输出结果如下:

red       0.0
blue      1.0
yellow    2.0
white     NaN
green     9.0
dtype: float64
-------------
red       0.0
blue      1.0
yellow    2.0
white     NaN
green     9.0
dtype: float64

过滤 NaN

数据分析过程中，有几种去除NaN的方式，使用dropna()函数过滤，示例如下:

import pandas as pd
import numpy as np

ser = pd.Series([0, 1, 2, np.NaN, 9], index=['red', 'blue', 'yellow', 'white', 'green'])
print(ser.dropna())

输出结果如下:

red       0.0
blue      1.0
yellow    2.0
green     9.0
dtype: float64

另一种方法，使用notnull()函数作为选取元素的条件，实现直接过滤。

import pandas as pd
import numpy as np

ser = pd.Series([0, 1, 2, np.NaN, 9], index=['red', 'blue', 'yellow', 'white', 'green'])
print(ser[ser.notnull()])

输出结果如下:

red       0.0
blue      1.0
yellow    2.0
green     9.0
dtype: float64

DataFrame处理起来要稍微复杂点，如果对这类对象使用dropna()函数，只要行或列有一个NaN元素，该行或列的全部元素都会被删除。

import pandas as pd
import numpy as np

df = pd.DataFrame([[6,np.nan,6], [np.nan,np.nan,np.nan], [2,np.nan,5]],
                   index=['blue', 'green', 'red'],
                   columns=['ball', 'mug', 'pen'])
print(df)
print("--------------")
print(df.dropna())

输出结果如下:

       ball  mug  pen
blue    6.0  NaN  6.0
green   NaN  NaN  NaN
red     2.0  NaN  5.0
--------------
Empty DataFrame
Columns: [ball, mug, pen]
Index: []

为了避免删除整行或整列，可以使用how选项，指定其值为all，告知dropna()函数只删除所有元素均为NaN的行或列。

import pandas as pd
import numpy as np

df = pd.DataFrame([[6,np.nan,6], [np.nan,np.nan,np.nan], [2,np.nan,5]],
                   index=['blue', 'green', 'red'],
                   columns=['ball', 'mug', 'pen'])
print(df)
print("--------------")
print(df.dropna(how='all'))

输出结果如下:

       ball  mug  pen
blue    6.0  NaN  6.0
green   NaN  NaN  NaN
red     2.0  NaN  5.0
--------------
      ball  mug  pen
blue   6.0  NaN  6.0
red    2.0  NaN  5.0

为NaN元素填充其他值

删除NaN元素，可能会删除跟数据分析相关的其他数据，所以与其冒着风险去过滤NaN元素，不如用其他数值代替NaN。fillna()函数用以替换NaN的元素作为参数，所有NaN可以替换为同一个元素。如下例所示：

import pandas as pd
import numpy as np

df = pd.DataFrame([[6,np.nan,6], [np.nan,np.nan,np.nan], [2,np.nan,5]],
                   index=['blue', 'green', 'red'],
                   columns=['ball', 'mug', 'pen'])
print(df)
print("--------------")
print(df.fillna(0))

输出结果如下:

       ball  mug  pen
blue    6.0  NaN  6.0
green   NaN  NaN  NaN
red     2.0  NaN  5.0
--------------
       ball  mug  pen
blue    6.0  0.0  6.0
green   0.0  0.0  0.0
red     2.0  0.0  5.0

若要将不同列的NaN替换为不同的元素，依次指定列名称及要替换成的元素即可。

import pandas as pd
import numpy as np

df = pd.DataFrame([[6,np.nan,6], [np.nan,np.nan,np.nan], [2,np.nan,5]],
                   index=['blue', 'green', 'red'],
                   columns=['ball', 'mug', 'pen'])
print(df)
print("--------------")
print(df.fillna({'ball':1,'mug':0,'pen':99}))

输出结果如下:

       ball  mug  pen
blue    6.0  NaN  6.0
green   NaN  NaN  NaN
red     2.0  NaN  5.0
--------------
       ball  mug   pen
blue    6.0  0.0   6.0
green   1.0  0.0  99.0
red     2.0  0.0   5.0