Numpy 用scikits-learn对道琼斯成分股做聚类分析

用scikits-learn对道琼斯成分股做聚类分析聚类(clustering)代表一类机器学习算法,用来基于相似度对研究对象分组。本例将使用道琼斯工业指数成分股的对数收益率数据进行聚类分析。

具体步骤

首先需要从雅虎财经频道下载这些股票的盘后数据。然后,计算平方亲和度矩阵(square affinity matrix)。最后,用AffinityPropagation类对股票进行聚类分析。

  1. 下载股价数据。

使用道琼斯工业指数成分股代码下载2011年的股价数据。本例中,我们只对收盘价感兴趣。

# 2011到2012
start = datetime.datetime(2011, 01, 01)
end = datetime.datetime(2012, 01, 01)

#道琼斯工业指数成分股代码
symbols = ["AA", "AXP", "BA", "BAC", "CAT",
           "CSCO", "CVX", "DD", "DIS", "GE", "HD",
           "HPQ", "IBM", "INTC", "JNJ", "JPM", "KFT",
           "KO", "MCD", "MMM", "MRK", "MSFT", "PFE",
           "PG", "T", "TRV", "UTX", "VZ", "WMT", "XOM"]

quotes = [finance.quotes_historical_yahoo
    (symbol, start, end, asobject=True)
        for symbol in symbols]

close = numpy.array([q.close for q in quotes]).astype(numpy.float)

  1. 计算亲和度矩阵。

把对数收益率作为度量值,计算不同股票之间的相似度。我们试图做的是计算数据点之间的欧式距离。

logreturns = numpy.diff(numpy.log(close))
print logreturns.shape

logreturns_norms = numpy.sum(logreturns ** 2, axis=1)
S = - logreturns_norms[:, numpy.newaxis] - logreturns_norms[numpy.newaxis, :] + 2 *  numpy.dot(logreturns, logreturns.T)

  1. 股票的聚类分析。

把上一步得到的结果提供给AffinityPropagation类。该类用来给数据点做标记,在本例中就是给股票标记适当的簇号。

aff_pro = sklearn.cluster.AffinityPropagation().fit(S)
labels = aff_pro.labels_

for i in xrange(len(labels)):
    print '%s in Cluster %d' % (symbols[i], labels[i])

完整的聚类程序如下。

import datetime
import numpy
import sklearn.cluster
from matplotlib import finance

#1. 下载股价数据

# 2011到2012
start = datetime.datetime(2011, 01, 01)
end = datetime.datetime(2012, 01, 01)

#道琼斯工业指数成分股代码
symbols = ["AA", "AXP", "BA", "BAC", "CAT",
           "CSCO", "CVX", "DD", "DIS", "GE", "HD",
           "HPQ", "IBM", "INTC", "JNJ", "JPM", "KFT",
           "KO", "MCD", "MMM", "MRK", "MSFT", "PFE",
           "PG", "T", "TRV", "UTX", "VZ", "WMT", "XOM"]

quotes = [finance.quotes_historical_yahoo(symbol, start, end, asobject=True) for symbol in symbols]

close = numpy.array([q.close for q in quotes]).astype(numpy.float)
print close.shape

#2. 计算亲和度矩阵
logreturns = numpy.diff(numpy.log(close))
print logreturns.shape

logreturns_norms = numpy.sum(logreturns ** 2, axis=1)
S = - logreturns_norms[:, numpy.newaxis] -                                            logreturns_norms[numpy.newaxis, :] + 2 * numpy.dot(logreturns, logreturns.T)

#3. 亲和传播聚类
aff_pro = sklearn.cluster.AffinityPropagation().fit(S)
labels = aff_pro.labels_

for i in xrange(len(labels)):
    print '%s in Cluster %d' % (symbols[i], labels[i])

程序的输出结果列出了股票代码及其对应的簇号,如下所示。

(30, 252)
(30, 251)
AA in Cluster 0
AXP in Cluster 6
BA in Cluster 6
BAC in Cluster 1
CAT in Cluster 6
CSCO in Cluster 2
CVX in Cluster 7
DD in Cluster 6
DIS in Cluster 6
GE in Cluster 6
HD in Cluster 5
HPQ in Cluster 3
IBM in Cluster 5
INTC in Cluster 6
JNJ in Cluster 5
JPM in Cluster 4
KFT in Cluster 5
KO in Cluster 5
MCD in Cluster 5
MMM in Cluster 6
MRK in Cluster 5
MSFT in Cluster 5
PFE in Cluster 7
PG in Cluster 5
T in Cluster 5
TRV in Cluster 5
UTX in Cluster 6
VZ in Cluster 5
WMT in Cluster 5
XOM in Cluster 7

攻略小结

下表的内容是对本章所用函数的概述:

函数 功能描述
sklearn.cluster.AffinityPropagation() 创建一个AffinityPropagation对象
sklearn.cluster.AffinityPropagation.fit 用欧式距离计算亲和度矩阵,并应用亲和传播聚类算法
diff 计算NumPy数组中数字间的差值。如果没有特别指明,默认计算一阶差分
log 计算NumPy数组中各个元素的自然对数
sum 对NumPy数组中的各个元素求和
dot 对二维数组,做矩阵乘法。对一维数组,则做内积运算

Python教程

Java教程

Web教程

数据库教程

图形图像教程

大数据教程

开发工具教程

计算机教程