用scikits-learn对道琼斯成分股做聚类分析,聚类(clustering)代表一类机器学习算法,用来基于相似度对研究对象分组。本例将使用道琼斯工业指数成分股的对数收益率数据进行聚类分析。
具体步骤
首先需要从雅虎财经频道下载这些股票的盘后数据。然后,计算平方亲和度矩阵(square affinity matrix)。最后,用AffinityPropagation
类对股票进行聚类分析。
- 下载股价数据。
使用道琼斯工业指数成分股代码下载2011年的股价数据。本例中,我们只对收盘价感兴趣。
# 2011到2012
start = datetime.datetime(2011, 01, 01)
end = datetime.datetime(2012, 01, 01)
#道琼斯工业指数成分股代码
symbols = ["AA", "AXP", "BA", "BAC", "CAT",
"CSCO", "CVX", "DD", "DIS", "GE", "HD",
"HPQ", "IBM", "INTC", "JNJ", "JPM", "KFT",
"KO", "MCD", "MMM", "MRK", "MSFT", "PFE",
"PG", "T", "TRV", "UTX", "VZ", "WMT", "XOM"]
quotes = [finance.quotes_historical_yahoo
(symbol, start, end, asobject=True)
for symbol in symbols]
close = numpy.array([q.close for q in quotes]).astype(numpy.float)
- 计算亲和度矩阵。
把对数收益率作为度量值,计算不同股票之间的相似度。我们试图做的是计算数据点之间的欧式距离。
logreturns = numpy.diff(numpy.log(close))
print logreturns.shape
logreturns_norms = numpy.sum(logreturns ** 2, axis=1)
S = - logreturns_norms[:, numpy.newaxis] - logreturns_norms[numpy.newaxis, :] + 2 * numpy.dot(logreturns, logreturns.T)
- 股票的聚类分析。
把上一步得到的结果提供给AffinityPropagation
类。该类用来给数据点做标记,在本例中就是给股票标记适当的簇号。
aff_pro = sklearn.cluster.AffinityPropagation().fit(S)
labels = aff_pro.labels_
for i in xrange(len(labels)):
print '%s in Cluster %d' % (symbols[i], labels[i])
完整的聚类程序如下。
import datetime
import numpy
import sklearn.cluster
from matplotlib import finance
#1. 下载股价数据
# 2011到2012
start = datetime.datetime(2011, 01, 01)
end = datetime.datetime(2012, 01, 01)
#道琼斯工业指数成分股代码
symbols = ["AA", "AXP", "BA", "BAC", "CAT",
"CSCO", "CVX", "DD", "DIS", "GE", "HD",
"HPQ", "IBM", "INTC", "JNJ", "JPM", "KFT",
"KO", "MCD", "MMM", "MRK", "MSFT", "PFE",
"PG", "T", "TRV", "UTX", "VZ", "WMT", "XOM"]
quotes = [finance.quotes_historical_yahoo(symbol, start, end, asobject=True) for symbol in symbols]
close = numpy.array([q.close for q in quotes]).astype(numpy.float)
print close.shape
#2. 计算亲和度矩阵
logreturns = numpy.diff(numpy.log(close))
print logreturns.shape
logreturns_norms = numpy.sum(logreturns ** 2, axis=1)
S = - logreturns_norms[:, numpy.newaxis] - logreturns_norms[numpy.newaxis, :] + 2 * numpy.dot(logreturns, logreturns.T)
#3. 亲和传播聚类
aff_pro = sklearn.cluster.AffinityPropagation().fit(S)
labels = aff_pro.labels_
for i in xrange(len(labels)):
print '%s in Cluster %d' % (symbols[i], labels[i])
程序的输出结果列出了股票代码及其对应的簇号,如下所示。
(30, 252)
(30, 251)
AA in Cluster 0
AXP in Cluster 6
BA in Cluster 6
BAC in Cluster 1
CAT in Cluster 6
CSCO in Cluster 2
CVX in Cluster 7
DD in Cluster 6
DIS in Cluster 6
GE in Cluster 6
HD in Cluster 5
HPQ in Cluster 3
IBM in Cluster 5
INTC in Cluster 6
JNJ in Cluster 5
JPM in Cluster 4
KFT in Cluster 5
KO in Cluster 5
MCD in Cluster 5
MMM in Cluster 6
MRK in Cluster 5
MSFT in Cluster 5
PFE in Cluster 7
PG in Cluster 5
T in Cluster 5
TRV in Cluster 5
UTX in Cluster 6
VZ in Cluster 5
WMT in Cluster 5
XOM in Cluster 7
攻略小结
下表的内容是对本章所用函数的概述:
函数 | 功能描述 |
---|---|
sklearn.cluster.AffinityPropagation() |
创建一个AffinityPropagation 对象 |
sklearn.cluster.AffinityPropagation.fit |
用欧式距离计算亲和度矩阵,并应用亲和传播聚类算法 |
diff |
计算NumPy数组中数字间的差值。如果没有特别指明,默认计算一阶差分 |
log |
计算NumPy数组中各个元素的自然对数 |
sum |
对NumPy数组中的各个元素求和 |
dot |
对二维数组,做矩阵乘法。对一维数组,则做内积运算 |