在Pandas数据框架中对分类变量进行分组

首先，我们必须了解什么是pandas中的分类变量。分类是python的pandas库中的数据类型。一个分类变量只接受一个固定类别（通常是固定数量）的值。一些分类变量的例子是性别、血型、语言等。与这些变量的一个主要区别是，不能对这些变量进行数学运算。

在pandas中可以使用Dataframe构造函数创建一个由分类值组成的数据框架，并指定dtype = “category” 。

# importing pandas as pd 
import pandas as pd 
  
# Create the dataframe 
# with categorical variable 
df = pd.DataFrame({'A': ['a', 'b', 'c',
                         'c', 'a', 'b'],
                   'B': [0, 1, 1, 0, 1, 0]},
                  dtype = "category")
# show the data types
df.dtypes

输出:

这里有一件很重要的事情，就是每一列产生的类别是不一样的，转换是逐列进行的，我们可以在这里看到。

输出:

在Pandas数据框架中对分类变量进行分组

现在，在一些工作中，我们需要对我们的分类数据进行分组。这可以通过pandas中的groupby()方法完成。它返回所有的groupby列的组合。与groupby一起，我们必须传递一个聚合函数，以确保我们在什么基础上对变量进行分组。一些聚合函数是mean(), sum(), count()等。

现在应用我们的groupby()和count()函数。

# initial state
print(df)
  
# counting number of each category
print(df.groupby(['A']).count().reset_index())

输出:

在Pandas数据框架中对分类变量进行分组

dataframe

在Pandas数据框架中对分类变量进行分组

按列’A’分组

现在，再举一个使用mean()函数的例子。这里A列被转换为分类，其他都是数字，根据A列和B列的类别计算平均值。

# importing pandas as pd 
import pandas as pd 
  
# Create the dataframe 
df = pd.DataFrame({'A': ['a', 'b', 'c', 
                         'c', 'a', 'b'], 
                   'B': [0, 1, 1, 
                         0, 1, 0], 
                   'C':[7, 8, 9,
                        5, 3, 6]})
  
# change tha datatype of 
# column 'A' into category
# data type
df['A'] = df['A'].astype('category')                                                                                                                                                                                                   
  
# initial state
print(df)
  
# calculating mean with 
# all combinations of A and B
print(df.groupby(['A','B']).mean().reset_index())

输出:

在Pandas数据框架中对分类变量进行分组