如何在 Python 中处理分类变量的缺失值

机器学习是一个研究领域，它使计算机有能力在没有明确编程的情况下进行学习。我们经常会遇到一些数据集，其中的一些值在列中缺失。当我们将机器学习模型应用于数据集时，这就造成了问题。这增加了我们在训练机器学习模型时出错的机会。

我们使用的数据集是。

# import modules
import pandas as pd
import numpy as np
 
# assign dataset
df = pd.read_csv("train.csv", header=None)
df.head

如何在 Python 中处理分类变量的缺失值？

计算缺失的数据。

# counting number of values of all the columns
cnt_missing = (df[[1, 2, 3, 4,
                   5, 6, 7, 8]] == 0).sum()
print(cnt_missing)

如何在 Python 中处理分类变量的缺失值？

我们看到，1,2,3,4,5列的数据是缺失的。现在我们将用NaN替换所有的0值。

from numpy import nan
df[[1, 2, 3, 4, 5]] = df[[1, 2, 3, 4, 5]].replace(0, nan)
df.head(10)

如何在 Python 中处理分类变量的缺失值？

处理缺失的数据是很重要的，所以我们将通过以下方法消除这个问题：

步骤 #1

第一种方法是简单地删除有缺失数据的行。

# printing initial shape
print(df.shape)
df.dropna(inplace=True)
 
# final shape of the data with
# missing rows removed
print(df.shape)

如何在 Python 中处理分类变量的缺失值？

但在此过程中，出现的问题是，当我们有小的数据集时，如果我们删除有缺失数据的行，那么数据集就会变得非常小，机器学习模型在小的数据集上不会得到好的结果。

所以为了避免这个问题，我们有了第二个方法。下一个方法是输入缺失值。我们通过用一些随机值或者用其余数据的中位数/平均值来替换缺失值来做到这一点。

步骤 #2

我们首先用数据的平均值来估算缺失值。

# filling missing values
# with mean column values
df.fillna(df.mean(), inplace=True)
df.sample(10)

如何在 Python 中处理分类变量的缺失值？

我们也可以通过使用SimpleImputer类来做到这一点。SimpleImputer是一个scikit-learn类，有助于处理预测模型数据集中的缺失数据。它通过使用SimpleImputer()方法来实现，该方法需要以下参数。

SimpleImputer(missing_values, strategy, fill_value)

missing_values : 缺少的价值占位符，需要被计算。默认为NaN。
strategy : 将取代数据集中的NaN值的数据。策略参数可以取值–“平均值”（默认）、”中位数”、”最频繁 “和 “常数”。
fill_value : 使用常数策略赋予NaN数据的常数值。

# import modules
from numpy import isnan
from sklearn.impute import SimpleImputer
 
value = df.values
 
# defining the imputer
imputer = SimpleImputer(missing_values=nan,
                        strategy='mean')
 
# transform the dataset
transformed_values = imputer.fit_transform(value)
 
# count the number of NaN values in each column
print("Missing:", isnan(transformed_values).sum())

如何在 Python 中处理分类变量的缺失值？

步骤 #3

我们首先通过数据的中位数来估算缺失值。中位数是一组数据的中间值。要确定一串数字的中位数，首先必须将这些数字按升序排列。

# filling missing values
# with mean column values
df.fillna(df.median(), inplace=True)
df.head(10)

如何在 Python 中处理分类变量的缺失值？

我们也可以通过使用SimpleImputer类来做到这一点。

# import modules
from numpy import isnan
from sklearn.impute import SimpleImputer
value = df.values
 
# defining the imputer
imputer = SimpleImputer(missing_values=nan,
                        strategy='median')
 
# transform the dataset
transformed_values = imputer.fit_transform(value)
 
# count the number of NaN values in each column
print("Missing:", isnan(transformed_values).sum())

如何在 Python 中处理分类变量的缺失值？

步骤 #4

我们首先通过数据的模式来归纳缺失值。模式是指在一组观测值中出现频率最高的数值。例如，{6，3，9，6，6，5，9，3}的模式是6，因为它出现得最频繁。

# filling missing values
# with mean column values
df.fillna(df.mode(), inplace=True)
df.sample(10)

如何在 Python 中处理分类变量的缺失值？

我们也可以通过使用SimpleImputer类来做到这一点。

# import modules
from numpy import isnan
from sklearn.impute import SimpleImputer
value = df.values
 
# defining the imputer
imputer = SimpleImputer(missing_values=nan,
                        strategy='most_frequent')
 
# transform the dataset
transformed_values = imputer.fit_transform(value)
 
# count the number of NaN values in each column
print("Missing:", isnan(transformed_values).sum())

如何在 Python 中处理分类变量的缺失值？