Pandas的groupby聚合函数并不能降维

在数据分析中，我们常常需要对数据进行分组，统计每组的信息。Pandas提供了非常便捷的groupby功能，可以帮助我们实现数据分组后的聚合操作。常用的聚合函数有：sum、mean、count、max、min等等。但是，在使用groupby进行聚合操作时，我们有些时候会发现，即使已经使用了聚合函数，结果仍然没有降维。本文将深入探讨这种情况的原因。

阅读更多：Pandas 教程

Pandas中的groupby函数

Pandas中的groupby函数是一种非常强大、常用的数据聚合操作。它可以把一个DataFrame对象划分成多个组，并对各组进行各种操作。groupby的一般使用方法如下：

import pandas as pd

# 读取数据
data = pd.read_csv('data.csv')

# 按照某个列进行分组，并对其他列求和
result = data.groupby('column_name')['other_column'].sum()

在上述代码中，我们首先使用Pandas读取数据，然后使用groupby函数对数据按照指定的列进行分组，最后对分组后的数据使用sum、mean等函数进行聚合操作。这样，我们就可以快速得到各组的信息。

但是，有时候我们会发现，即使已经对数据进行了聚合操作，结果仍然没有降维。这是为什么呢？

groupby的agg函数

在groupby后面加上agg函数可以对每一组的数据进行自定义的计算并返回结果。agg接受一个函数或者函数列表作为参数，函数中的计算结果会被附加在原来的dataframe中。常用的函数包括sum、mean、count、max、min等等。

import pandas as pd

# 读取数据
data = pd.read_csv('data.csv')

# 按照某个列进行分组，并对其他列求和
result = data.groupby('column_name')['other_column'].agg(sum)

在上述代码中，我们使用groupby函数对数据进行分组，然后使用agg函数对每一组的数据进行求和操作。

agg函数未能降维

在实际使用中，我们会发现有些时候agg函数并不能对数据进行降维。原因是什么呢？

首先，我们来看一个简单的例子：

import pandas as pd

# 读取数据
data = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
                     'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
                     'C': [1, 2, 3, 4, 5, 6, 7, 8],
                     'D': [5, 4, 3, 2, 1, 0, 9, 8]})

# 对A、B进行分组，并对C列求和，D列求平均值
result = data.groupby(['A', 'B']).agg({'C': 'sum', 'D': 'mean'})

在这个例子中，我们对data进行了分组，在分组后的数据上对C和D列进行了不同的操作。但是，这个结果并不符合我们的预期，因为返回的是一个多层次索引的DataFrame，没有完成降维的操作。

那么，为什么会出现这种情况呢？

我们来看看agg函数的docstring。

Signature: result.agg(func=None, axis=0, *args, **kwargs)
Docstring:
Aggregate using one or more operations over the specified axis.

Parameters
----------
func : function, str, list or dict
    Function to use for aggregating the data. If a function, must either
    work when passed a DataFrame or when passed to DataFrame.apply. If
str or list of functions, a list of functions must be passed to
    corresponding column or row. If dict, then the keys must be the column
    names, and the values should be the functions.
axis : {0 or 'index', 1 or 'columns'}, default 0
    If 0 or 'index': apply function to each column.
    If 1 or 'columns': apply function to each row.
    Note: axis=1 is equivalent to (but faster than) transposing and
    using axis=0.
*args
    Positional arguments to pass to `func`.
**kwargs
    Keyword arguments to pass to `func`.

Returns
-------
aggregated : scalar, Series or DataFrame
    If func is a dict, Series will have same number of items as dict keys.
    If func is not a dict, output will be DataFrame.

See Also
--------
DataFrame.apply : 对DataFrame提供的另一种内置函数，apply通常用于使用用户
                定义的函数

Examples
--------
>>> df = pd.DataFrame({'A': [1, 2, 3],
...                    'B': [4, 5, 6],
...                    'C': [7, 8, 9]})
>>> df.agg(['sum', 'min'])
       A   B   C
sum   6  15  24
min   1   4   7
>>> df.agg({'A': ['sum', 'min'], 'B': 'min'})
     A  B
sum  6  4
min  1  4

从docstring可以看出，当我们使用agg函数时，返回的结果类型取决于函数的类型。如果我们传递进来的是一个函数，那么agg返回的结果将会是一个DataFrame；如果我们传递进来的是一个字典，那么agg返回的结果将会是一个Series。

因此，在上面的例子中我们传递的是一个字典，所以返回的结果是一个Series类型。换句话说，DataFrame的结构并没有被降低，仍然是多层次的，只是索引的结构变了而已。

在实际应用中，我们可以通过重新设置索引，将返回的结果降低到合适的维度。

import pandas as pd

# 读取数据
data = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
                     'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
                     'C': [1, 2, 3, 4, 5, 6, 7, 8],
                     'D': [5, 4, 3, 2, 1, 0, 9, 8]})

# 对A、B进行分组，并对C列求和，D列求平均值
result = data.groupby(['A', 'B']).agg({'C': 'sum', 'D': 'mean'})

# 重新设置索引，降低维度
result = result.reset_index()

上述代码中，我们在agg函数的返回结果上，通过reset_index()函数对索引进行了重新设置。这样，就得到了我们想要的DataFrame对象。

总结

Pandas提供了非常强大的groupby函数，可以快速对DataFrame数据对象进行分组，实现分组后的数据聚合操作。在实际使用中，我们需要使用agg函数来对数据进行聚合操作。如果在使用agg函数后发现返回的结果仍然没有降维，需要检查返回的结果类型，可能需要通过reset_index()函数重新设置索引，降低维度。