Pandas Cut–从连续到分类
在数据分析中经常看到连续的、高度倾斜的数据等数字数据。有时,分析在从连续数据到离散数据的转换上变得毫不费力。有很多方法可以进行转换,其中一种方法是使用Pandas的集成切割函数。Pandas的切割函数是一种将数字连续数据转换为分类数据的杰出方式。它有3个主要的必要部分。
1.首先是输入所需的一维数组/数据帧。
2.另一个主要部分是Bins。仓,代表连续数据的独立仓的边界。第一个数字表示仓的起点,后面的数字表示仓的终点。剪切函数允许更明确的分档
3.最后的主要部分是标签。标签的数量毫无例外地会比仓的数量少一个。
注意:对于任何NA值,结果将被存储为NA。超出范围的值也将在结果的分类仓中显示为NA。
在使用pandas cut函数时,它不能保证每个bin中的值的分布。事实上,我们最终可能会以这样一种方式来定义仓,即仓中可能不包含任何值。
语法:
pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates=’raise’, ordered=True)
参数:
- x:输入数组。需要是一维的。
- bins:表示用于分割的bin边界
- right: 表示是否应包括仓位的最右边缘。布尔类型的值。默认值为True。
- labels”定义返回的细分仓的标签。数组或布尔值
返回值:返回一个分类系列/numpy数组/IntervalIndex
例子1:假设我们有一个由1到100的15个随机数组成的数组’Age’,我们希望将数据分成4个类别的bin —
'Baby/Toddler' :- 0 to 3 years
'Child' :- 4 to 17 years
'Adult' :- 18 to 63 years
'Elderly' :- 64 to 99 years
# Importing pandas and numpy libraries
import pandas as pd
import numpy as np
# Creating a dummy DataFrame of 15 numbers randomly
# ranging from 1-100 for age
df = pd.DataFrame({'Age': [42, 15, 67, 55, 1, 29, 75, 89, 4,
10, 15, 38, 22, 77]})
# Printing DataFrame Before sorting Continuous
# to Categories
print("Before: ")
print(df)
# A column of name 'Label' is created in DataFrame
# Categorizing Age into 4 Categories
# Baby/Toddler: (0,3], 0 is excluded & 3 is included
# Child: (3,17], 3 is excluded & 17 is included
# Adult: (17,63], 17 is excluded & 63 is included
# Elderly: (63,99], 63 is excluded & 99 is included
df['Label'] = pd.cut(x=df['Age'], bins=[0, 3, 17, 63, 99],
labels=['Baby/Toddler', 'Child', 'Adult',
'Elderly'])
# Printing DataFrame after sorting Continuous to
# Categories
print("After: ")
print(df)
# Check the number of values in each bin
print("Categories: ")
print(df['Label'].value_counts())
输出:
Before:
Age
0 42
1 15
2 67
3 55
4 1
5 29
6 75
7 89
8 4
9 10
10 15
11 38
12 22
13 77
After:
Age Label
0 42 Adult
1 15 Child
2 67 Elderly
3 55 Adult
4 1 Baby/Toddler
5 29 Adult
6 75 Elderly
7 89 Elderly
8 4 Child
9 10 Child
10 15 Child
11 38 Adult
12 22 Adult
13 77 Elderly
Categories:
Adult 5
Elderly 4
Child 4
Baby/Toddler 1
Name: Label, dtype: int64
例子2:假设我们有一个数组 “高度”,其中有12个随机的人,从150厘米到180厘米,我们希望将数据分成3个类别的仓。
'Short' :- greater than 150cm upto 157cm
'Average' :- greater than 157cm upto 170cm
'Tall' :- greater than 170cm upto 180cm
# Importing pandas and numpy libraries
import pandas as pd
import numpy as np
# Creating a dummy DataFrame of 12 numbers randomly
# ranging from 150-180 for height
df = pd.DataFrame({'Height': [150.4, 157.6, 170, 176, 164.2, 155,
159.2, 175, 162.4, 176, 153, 170.9]})
# Printing DataFrame Before Sorting Continuous to Categories
print("Before: ")
print(df)
# A column of name 'Label' is created in DataFrame
# Categorizing Height into 3 Categories
# Short: (150,157], 150 is excluded & 157 is included
# Average: (157,169], 157 is excluded & 169 is included
# Tall: (169,180], 169 is excluded & 180 is included
df['Label'] = pd.cut(x=df['Height'],
bins=[150, 157, 169, 180],
labels=['Short', 'Average', 'Tall'])
# Printing DataFrame After Sorting Continuous to Categories
print("After: ")
print(df)
# Check the number of values in each bin
print("Categories: ")
print(df['Label'].value_counts())
输出:
Before:
Height
0 150.4
1 157.6
2 170.0
3 176.0
4 164.2
5 155.0
6 159.2
7 175.0
8 162.4
9 176.0
10 153.0
11 170.9
After:
Height Label
0 150.4 Short
1 157.6 Average
2 170.0 Tall
3 176.0 Tall
4 164.2 Average
5 155.0 Short
6 159.2 Average
7 175.0 Tall
8 162.4 Average
9 176.0 Tall
10 153.0 Short
11 170.9 Tall
Categories:
Tall 5
Average 4
Short 3
Name: Label, dtype: int64