在Pandas中对分组应用操作

Pandas是一个用于数据分析和数据处理的Python库。通常情况下，数据分析需要将数据分成若干组，对这些组进行各种操作。Pandas中的GroupBy函数采用了分割-应用-合并的策略，这意味着它执行了一个组合–分割对象，对对象应用函数并合并结果。在这篇文章中，我们将使用groupby()函数来对分组数据进行各种操作。

聚合

聚合涉及用平均数、中位数、模式、min（最小值）、max（最大值）、std（标准差）、var（方差）、sum、count等方法创建数据的统计摘要。要对组进行汇总操作。

Import module
创建或加载数据
创建一个GroupBy对象，沿着一个键或多个键对数据进行分组
应用一个统计操作。

例子1：计算男性和女性群体的平均工资和年龄。它给出了数字列的平均值，并在列名上加了一个前缀。

# Import required libraries
import pandas as pd
import numpy as np
 
# Create a sample dataframe
df = pd.DataFrame({"dept": np.random.choice(["IT", "HR", "Sales", "Production"], size=50),
                   "gender": np.random.choice(["F", "M"], size=50),
                   "age": np.random.randint(22, 60, size=50),
                   "salary": np.random.randint(20000, 90000, size=50)})
df.index.name = "emp_id"
 
# Calculate mean data of gender groups
df.groupby('gender').mean().add_prefix('mean_')

输出:

在Pandas中对群组应用操作

示例2：使用聚合函数（ DataFrameGroupBy.agg ）执行多个聚合操作，该函数接受一个字符串、函数或函数列表。

# Import required libraries
import pandas as pd
import numpy as np
 
# Create a sample dataframe
df = pd.DataFrame({"dept": np.random.choice(["IT", "HR", "Sales", "Production"], size=50),
                   "gender": np.random.choice(["F", "M"], size=50),
                   "age": np.random.randint(22, 60, size=50),
                   "salary": np.random.randint(20000, 90000, size=50)})
df.index.name = "emp_id"
 
# Calculate min, max, mean and count of salaries
# in different departments for males and females
df.groupby(['dept', 'gender'])['salary'].agg(["min", "max", "mean", "count"])

输出:

在Pandas中对群组应用操作

例子3：指定多个列和其相应的聚合操作，如下所示。

# Import required libraries
import pandas as pd
import numpy as np
 
# Create a sample dataframe
df = pd.DataFrame({"dept": np.random.choice(["IT", "HR", "Sales", "Production"], size=50),
                   "gender": np.random.choice(["F", "M"], size=50),
                   "age": np.random.randint(22, 60, size=50),
                   "salary": np.random.randint(20000, 90000, size=50)})
df.index.name = "emp_id"
 
# Calculate mean salaries and min-max age of employees
# in different departments for gender groups
df.groupby(['dept', 'gender']).agg({'salary': 'mean', 'age': ['min', 'max']})

输出:

在Pandas中对群组应用操作

例子4：显示任何组的共同统计数据。

# Import required libraries
import pandas as pd
import numpy as np
 
# Create a sample dataframe
df = pd.DataFrame({"dept": np.random.choice(["IT", "HR", "Sales", "Production"], size=50),
                   "gender": np.random.choice(["F", "M"], size=50),
                   "age": np.random.randint(22, 60, size=50),
                   "salary": np.random.randint(20000, 90000, size=50)})
df.index.name = "emp_id"
 
# Statistics of employee age grouped by departments
df["age"].groupby(df['dept']).describe()

输出:

在Pandas中对群组应用操作

创建仓或组并应用操作。

Pandas的切割方法将数值分类到bin区间，形成组或类别。然后可以对这些组进行聚合或其他功能。这方面的实现如下所示。

例子：年龄被划分为年龄范围，并计算出样本数据中的观察数。****

# Import required libraries
import pandas as pd
import numpy as np
 
# Create a sample dataframe
df = pd.DataFrame({"dept": np.random.choice(["IT", "HR", "Sales", "Production"], size=50),
                   "gender": np.random.choice(["F", "M"], size=50),
                   "age": np.random.randint(22, 60, size=50),
                   "salary": np.random.randint(20000, 90000, size=50)})
df.index.name = "emp_id"
 
# Create bin intervals
bins = [20, 30, 45, 60]
 
# Segregate ages into bins of age groups
df['categories'] = pd.cut(df['age'], bins,
                          labels=['Young', 'Middle', 'Old'])
 
# Calculate number of observations in each age category
df['age'].groupby(df['categories']).count()

输出:

在Pandas中对群组应用操作

变换

变换是执行一个针对组的操作，其中单个值被改变，而数据的形状保持不变。我们使用transform()函数来实现这一目的。

示例 :

# Import required libraries
import pandas as pd
import numpy as np
 
# Create a sample dataframe
df = pd.DataFrame({"dept": np.random.choice(["IT", "HR", "Sales", "Production"], size=50),
                   "gender": np.random.choice(["F", "M"], size=50),
                   "age": np.random.randint(22, 60, size=50),
                   "salary": np.random.randint(20000, 90000, size=50)})
df.index.name = "emp_id"
 
# Calculate mean difference by transforming each salary value
df['mean_sal_diff'] = df['salary'].groupby(
    df['dept']).transform(lambda x: x - x.mean())
df.head()