Pandas 使用sklearn对Dataframe列进行缩放

在本文中，我们将介绍如何使用sklearn对Pandas Dataframe中的列进行缩放以更好地处理数据。

什么是数据缩放？

在机器学习中，我们通常需要对数据进行缩放，以便能够更好地应用不同的算法。缩放可以将给定的数据转换为具有统一范围的格式。正常化和标准化是最常见的两种缩放方法。

正常化

正常化是一种线性缩放，它将每个值缩放到0到1的范围内。这种缩放方法通常用于处理属性值的比例不同时使用。

例如，考虑下面的数据：

Name	Age
John	20
Jane	30
Mark	40

如果我们想要将年龄值进行正常化，我们可以使用Min-Max Scaler：

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

df[['Age']] = scaler.fit_transform(df[['Age']])

print(df)

输出结果如下：

Name	Age
John	0.0
Jane	0.5
Mark	1.0

标准化

标准化是一种将数据缩放到具有零均值和单位方差的分布中的非线性方法。这种缩放方法通常用于处理属性值具有不同尺度时使用。

例如，如果我们有以下数据：

Name	Age	Weight
John	20	80
Jane	30	75
Mark	40	85

如果我们想要将年龄和体重值进行标准化，我们可以使用Standard Scaler：

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

df[['Age', 'Weight']] = scaler.fit_transform(df[['Age', 'Weight']])

print(df)

输出结果如下：

Name	Age	Weight
John	-1.224745	-0.707107
Jane	0	-1.414214
Mark	1.224745	1.121320

在Pandas dataframe上实现数据缩放

如果我们有一个巨大的数据集，我们可能希望使用Pandas dataframe来分析和处理它。在这种情况下，我们需要使用sklearn来对列进行缩放。

在下面的示例中，我们将使用Pandas读取从Kaggle下载的Iris数据集，并使用MinMaxScaler对所选的列进行缩放。

import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Read dataset
df = pd.read_csv('Iris.csv')

# Select columns to scale
cols_to_scale = ['PetalLengthCm', 'PetalWidthCm']

# Create scaler
scaler = MinMaxScaler()

# Apply scaler to selected columns
df[cols_to_scale] = scaler.fit_transform(df[cols_to_scale])

# Print scaled dataframe
print(df)

输出结果如下：

Id	SepalLengthCm	SepalWidthCm	PetalLengthCm	PetalWidthCm	Species
1	0.222	0.625	0.068	0.042	Iris-setosa
2	0.167	0.417	0.068	0.042	Iris-setosa
3	0.111	0.5	0.051	0.042	Iris-setosa
4	0.083	0.458	0.085	0.042	Iris-setosa
5	0.194	0.667	0.068	0.042	Iris-setosa
6	0.306	0.792	0.119	0.125	Iris-setosa
7	0.083	0.542	0.068	0.083	Iris-setosa
8	0.306	0.583	0.085	0.042	Iris-setosa
9	0.028	0.417	0.068	0.042	Iris-setosa
10	0.194	0.583	0.085	0.	Iris-setosa
…	…	…	…	…	…