Python 生成所有组合

Python 生成所有组合,除了排列,itertools模块还提供了计算集合元素组合的函数。对于组合来说,顺序不重要。对于一个给定的集合,组合的数量远小于排列的数量,对于 P 个元素组成的集合,r元组合的数量为:

Python 生成所有组合

例如,5张扑克牌共有2 598 960种组合方式,以下代码列出了所有组合形式:

hands = list(
    combinations(tuple(product(range(13), '♠♥♦♣')), 5))

实际应用中,在对包含多个变量的数据集进行探索性分析时,经常要计算任意两个变量间的相关性。如果有 \upsilon 个变量,可用下面的表达式枚举所有需要比较的变量对:

combinations(range(v), 2)

下面从http://www.tylervigen.com取样本数据来展示完整的处理流程。首先从中选择3个有共同时间范围的样本:第7号、第43号和第3890号,把它们放在同一个数据表中,保留各自的“年份”列。

数据表第一行和后面按年份排列的数据行如下所示:

[('year', 'Per capita consumption of cheese (US)Pounds (USDA)',
 'Number of people who died by becoming tangled in their bedsheetsDeaths (US) (CDC)',
 'year', 'Per capita consumption of mozzarella cheese (US)Pounds (USDA)',
 'Civil engineering doctorates awarded (US)Degrees awarded (National Science Foundation)',
 'year', 'US crude oil imports from VenezuelaMillions of barrels (Dept. of Energy)',
 'Per capita consumption of high fructose corn syrup (US)Pounds (USDA)'),
     (2000, 29.8, 327, 2000, 9.3, 480, 2000, 446, 62.6),
 (2001, 30.1, 456, 2001, 9.7, 501, 2001, 471, 62.5),
 (2002, 30.5, 509, 2002, 9.7, 540, 2002, 438, 62.8),
 (2003, 30.6, 497, 2003, 9.7, 552, 2003, 436, 60.9),
 (2004, 31.3, 596, 2004, 9.9, 547, 2004, 473, 59.8),
 (2005, 31.7, 573, 2005, 10.2, 622, 2005, 449, 59.1),
 (2006, 32.6, 661, 2006, 10.5, 655, 2006, 416, 58.2),
 (2007, 33.1, 741, 2007, 11, 701, 2007, 420, 56.1),
 (2008, 32.7, 809, 2008, 10.6, 712, 2008, 381, 53),
 (2009, 32.8, 717, 2009, 10.6, 708, 2009, 352, 50.1)]

使用combinations()函数基于9个变量生成所有二元比较对。

combinations(range(9), 2)

共有36种组合,去掉其中由各个年份列形成的组合,它们的相关系数是1.00。

从数据集中提取列的函数如下所示:

from typing import TypeVar, Iterator, Iterable
T_ = TypeVar("T_")
def column(source: Iterable[List[T_]], x: int) -> Iterator[T_]:
    for row in source:
        yield row[x]

然后用前面介绍的corr()函数比较两列数据。

如下所示计算所有组合相关系数:

from itertools import *
from Chapter_4.ch04_ex4 import corr
for p, q in combinations(range(9), 2):
    header_p, *data_p = list(column(source, p))
    header_q, *data_q = list(column(source, q))
    if header_p == header_q:
        continue
    r_pq = corr(data_p, data_q)
    print("{2: 4.2f}: {0} vs {1}".format(header_p, header_q, r_pq))

对于组合在一起的列,首先将它们从数据集中提取出来,header_p, *data_p =语句通过多重赋值将序列的第一个值(即标题)与后面的数据分离。如果标题一致,说明参与计算的是同一列。在上面的数据集中,由于存在3个重复的年份列,所以要排除这种情况。

之后用相关性函数处理这些列,得到相关系数,再打印出这些列的标题,这里特意选择了几个模式不同但相关度很高的伪相关特征。
计算结果如下:

0.96: year vs Per capita consumption of cheese (US) Pounds (USDA)
0.95: year vs Number of people who died by becoming tangled in their
bedsheetsDeaths (US) (CDC)
0.92: year vs Per capita consumption of mozzarella cheese (US) Pounds (USDA)
0.98: year vs Civil engineering doctorates awarded (US) Degrees awarded (National
Science Foundation)
-0.80: year vs US crude oil imports from Venezuela Millions of barrels
(Dept. of Energy)
-0.95: year vs Per capita consumption of high fructose corn syrup (US) Pounds (USDA)
0.95: Per capita consumption of cheese (US) Pounds (USDA) vs Number of people who
died by becoming tangled in their bedsheetsDeaths (US) (CDC)
0.96: Per capita consumption of cheese (US) Pounds (USDA) vs year
0.98: Per capita consumption of cheese (US) Pounds (USDA) vs Per capita
consumption of mozzarella cheese (US) Pounds (USDA)
...
0.88: US crude oil imports from VenezuelaMillions of barrels (Dept. of Energy)
vs Per capita consumption of high fructose corn syrup (US) Pounds (USDA)

数据体现出的模式的意义尚不清楚,为什么存在相关性?这些缺乏明确意义的、含混的相关性会干扰统计分析,但我们找到了那些相关性很高却缺乏关联因素的数据。

这里的重点是使用简单的表达式combinations(range(9), 2)生成了所有可能的数据组合。利用这类简单易用的技术让我们可以专注于处理数据分析中的问题,而不必费心于构建组合算法。

Python教程

Java教程

Web教程

数据库教程

图形图像教程

大数据教程

开发工具教程

计算机教程