Pandas Cut–从连续到分类

在数据分析中经常看到连续的、高度倾斜的数据等数字数据。有时，分析在从连续数据到离散数据的转换上变得毫不费力。有很多方法可以进行转换，其中一种方法是使用Pandas的集成切割函数。Pandas的切割函数是一种将数字连续数据转换为分类数据的杰出方式。它有3个主要的必要部分。

1.首先是输入所需的一维数组/数据帧。
2.另一个主要部分是Bins。仓，代表连续数据的独立仓的边界。第一个数字表示仓的起点，后面的数字表示仓的终点。剪切函数允许更明确的分档
3.最后的主要部分是标签。标签的数量毫无例外地会比仓的数量少一个。

注意：对于任何NA值，结果将被存储为NA。超出范围的值也将在结果的分类仓中显示为NA。

在使用pandas cut函数时，它不能保证每个bin中的值的分布。事实上，我们最终可能会以这样一种方式来定义仓，即仓中可能不包含任何值。

语法:

pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates=’raise’, ordered=True)

参数:

x:输入数组。需要是一维的。
bins:表示用于分割的bin边界
right: 表示是否应包括仓位的最右边缘。布尔类型的值。默认值为True。
labels”定义返回的细分仓的标签。数组或布尔值

返回值：返回一个分类系列/numpy数组/IntervalIndex

例子1：假设我们有一个由1到100的15个随机数组成的数组’Age’，我们希望将数据分成4个类别的bin —

'Baby/Toddler' :- 0 to 3 years
'Child' :- 4 to 17 years
'Adult' :- 18 to 63 years
'Elderly' :- 64 to 99 years

# Importing pandas and numpy libraries
import pandas as pd
import numpy as np
  
# Creating a dummy DataFrame of 15 numbers randomly
# ranging from 1-100 for age
df = pd.DataFrame({'Age': [42, 15, 67, 55, 1, 29, 75, 89, 4,
                           10, 15, 38, 22, 77]})
  
# Printing DataFrame Before sorting Continuous 
# to Categories
print("Before: ")
print(df)
  
# A column of name 'Label' is created in DataFrame
# Categorizing Age into 4 Categories
# Baby/Toddler: (0,3], 0 is excluded & 3 is included
# Child: (3,17], 3 is excluded & 17 is included
# Adult: (17,63], 17 is excluded & 63 is included
# Elderly: (63,99], 63 is excluded & 99 is included
df['Label'] = pd.cut(x=df['Age'], bins=[0, 3, 17, 63, 99],
                     labels=['Baby/Toddler', 'Child', 'Adult',
                             'Elderly'])
  
# Printing DataFrame after sorting Continuous to
# Categories
print("After: ")
print(df)
  
# Check the number of values in each bin
print("Categories: ")
print(df['Label'].value_counts())

输出:

Before: 
    Age
0    42
1    15
2    67
3    55
4     1
5    29
6    75
7    89
8     4
9    10
10   15
11   38
12   22
13   77
After: 
    Age         Label
0    42         Adult
1    15         Child
2    67       Elderly
3    55         Adult
4     1  Baby/Toddler
5    29         Adult
6    75       Elderly
7    89       Elderly
8     4         Child
9    10         Child
10   15         Child
11   38         Adult
12   22         Adult
13   77       Elderly
Categories: 
Adult           5
Elderly         4
Child           4
Baby/Toddler    1
Name: Label, dtype: int64

例子2：假设我们有一个数组 “高度”，其中有12个随机的人，从150厘米到180厘米，我们希望将数据分成3个类别的仓。

'Short' :- greater than 150cm upto 157cm
'Average' :- greater than 157cm upto 170cm
'Tall' :- greater than 170cm upto 180cm

# Importing pandas and numpy libraries
import pandas as pd
import numpy as np
  
# Creating a dummy DataFrame of 12 numbers randomly
# ranging from 150-180 for height
df = pd.DataFrame({'Height': [150.4, 157.6, 170, 176, 164.2, 155,
                              159.2, 175, 162.4, 176, 153, 170.9]})
  
# Printing DataFrame Before Sorting Continuous to Categories
print("Before: ")
print(df)
  
# A column of name 'Label' is created in DataFrame
# Categorizing Height into 3 Categories
# Short: (150,157], 150 is excluded & 157 is included
# Average: (157,169], 157 is excluded & 169 is included
# Tall: (169,180], 169 is excluded & 180 is included
df['Label'] = pd.cut(x=df['Height'],
                     bins=[150, 157, 169, 180],
                     labels=['Short', 'Average', 'Tall'])
  
# Printing DataFrame After Sorting Continuous to Categories
print("After: ")
print(df)
  
# Check the number of values in each bin
print("Categories: ")
print(df['Label'].value_counts())

输出:

Before: 
    Height
0    150.4
1    157.6
2    170.0
3    176.0
4    164.2
5    155.0
6    159.2
7    175.0
8    162.4
9    176.0
10   153.0
11   170.9
After: 
    Height    Label
0    150.4    Short
1    157.6  Average
2    170.0     Tall
3    176.0     Tall
4    164.2  Average
5    155.0    Short
6    159.2  Average
7    175.0     Tall
8    162.4  Average
9    176.0     Tall
10   153.0    Short
11   170.9     Tall
Categories: 
Tall       5
Average    4
Short      3
Name: Label, dtype: int64