Pandas中的groupby函数filling（填充）

在本文中，我们将介绍Pandas中的groupby函数filling。这是一种通过在groups(组)内对缺失值填充数据的强大技术。假设我们有一个包含重要数据的数据集，但其中存在缺失值，那么我们需要将缺失值进行填充。

我们使用以下代码生成示例数据，该数据包括3列和10个行，并且每个列都有3个不同的值：

import pandas as pd
import numpy as np

# 生成示例数据
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
                         'foo', 'bar', 'foo', 'foo'],
                   'B': ['one', 'one', 'two', 'three',
                         'two', 'two', 'one', 'three'],
                   'C': np.random.randn(8),
                   'D': np.random.randn(8)})
df[df.C > 0] = np.nan
df

输出结果如下：

     A    B   C   D
0  NaN  NaN NaN NaN
1  NaN  NaN NaN NaN
2  NaN  NaN NaN NaN
3  NaN  NaN NaN NaN
4  NaN  NaN NaN NaN
5  bar  two NaN NaN
6  foo  one NaN NaN
7  foo  three NaN NaN
8  NaN  NaN NaN NaN
9  NaN  NaN NaN NaN

从上面的输出中可以看出，数据中存在缺失值。现在我们将使用groupby函数填充缺失值。

阅读更多：Pandas 教程

使用groupby函数填充缺失值

在Pandas中，使用groupby集合将数据拆分为相关的组。使用groupby函数，我们可以将数据集按A列按字母分组：

grouped = df.groupby('A')

现在有了这个分组对象，我们就可以在groups中调用fillna来填充缺失值：

df['C'].fillna(grouped['C'].transform('mean'), inplace=True)

上面的代码用每个组的平均值来填充缺失值。

让我们看看实际的结果：

     A    B         C         D
0  NaN  NaN -0.241574       NaN
1  NaN  NaN -0.241574       NaN
2  NaN  NaN -0.241574       NaN
3  NaN  NaN -0.241574       NaN
4  NaN  NaN -0.241574       NaN
5  bar  two  0.270023 -1.479222
6  foo  one -0.855685  0.210016
7  foo  three -0.855685 -0.806094
8  NaN  NaN -0.241574       NaN
9  NaN  NaN -0.241574       NaN

现在数据中的所有缺失值都已用所属组的平均值来进行填充。

让我们再来看一个更复杂的示例，这次，我们将使用一些函数来生成包含更多分组和丢失数据的示例数据集：

# 生成另一个示例数据集
np.random.seed(0)
s = pd.Series(np.random.randn(6))
s[::2] = np.nan
s

0         NaN
1    1.764052
2         NaN
3    0.400157
4         NaN
5    0.978738
dtype: float64

接下来，我们将针对2个列转换数据集：

# 将数据集改造为二维
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})
df[df.C > 0] = np.nan
df

     A    B   C   D
0  NaN  NaN NaN NaN
1  NaN  NaN
2  NaN
3  NaN
4  NaN
5  bar NaN NaN NaN
6  foo one NaN NaN
7  foo three NaN NaN
8  NaN NaN NaN NaN
9  NaN NaN NaN NaN

现在，我们使用groupby和fillna函数进行填充。假设我们想用所属组的最大值来填充缺失值，我们可以使用transform和max方法：

df.fillna(df.groupby(['A','B']).transform('max'), inplace=True)

现在数据中的所有缺失值都已使用相应组的最大值进行填充：

     A    B         C         D
0  NaN  NaN -0.204707  0.410599
1  NaN  NaN -0.204707  0.410599
2  NaN  NaN -0.385080 -0.379337
3  NaN  NaN -0.385080 -0.379337
4  NaN  NaN -0.385080 -0.379337
5  bar  two  2.240893  1.867558
6  foo  one  0.950088  0.400157
7  foo  three  1.454274  0.978738
8  NaN  NaN -0.204707  0.410599
9  NaN  NaN -0.204707  0.410599

总结

在本文中，我们介绍了如何在Pandas中使用groupby函数对缺失值进行填充。通过使用groupby函数，我们可以将数据集按照所属组分割，并对组内的缺失值进行填充，以提供更准确的数据。无论您是分析大型数据集还是小型数据集，groupby函数filling是一种非常有用的技术，它可以帮助您更轻松地处理缺失值，同时使您的结果更精确。