<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>
<channel>
	<title>超群.com的博客 &#187; Python</title>
	<atom:link href="http://www.fuchaoqun.com/category/python/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.fuchaoqun.com</link>
	<description></description>
	<lastBuildDate>Thu, 08 Sep 2011 15:08:19 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>一种比较省内存的稀疏矩阵Python存储方案</title>
		<link>http://www.fuchaoqun.com/2010/03/python-sparse-matrix/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=python-sparse-matrix</link>
		<comments>http://www.fuchaoqun.com/2010/03/python-sparse-matrix/#comments</comments>
		<pubDate>Sat, 13 Mar 2010 08:15:44 +0000</pubDate>
		<dc:creator>超群.com</dc:creator>
				<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[sparse matrix]]></category>
		<guid isPermaLink="false">http://www.fuchaoqun.com/?p=333</guid>
		<description><![CDATA[推荐系统中经常需要处理类似user_id, item_id, rating这样的数据，其实就是数学里面的稀疏矩阵，scipy中提供了sparse模块来解决这个问题，但scipy.sparse有很多问题不太合用：1、不能很好的同时支持data[i, ...]、data[..., j]、data[i, j]快速切片；2、由于数据保存在内存中，不能很好的支持海量数据处理。 要支持data[i, ...]、data[..., j]的快速切片，需要i或者j的数据集中存储；同时，为了保存海量的数据，也需要把数据的一部分放在硬盘上，用内存做buffer。这里的解决方案比较简单，用一个类Dict的东西来存储数据，对于某个i（比如9527），它的数据保存在dict['i9527']里面，同样的，对于某个j（比如3306），它的全部数据保存在dict['j3306']里面，需要取出data[9527, ...]的时候，只要取出dict['i9527']即可，dict['i9527']原本是一个dict对象，储存某个j对应的值，为了节省内存空间，我们把这个dict以二进制字符串形式存储，直接上代码： ''' Sparse Matrix ''' import struct import numpy as np import bsddb from cStringIO import StringIO &#160; class DictMatrix&#40;&#41;: def __init__&#40;self, container = &#123;&#125;, dft = 0.0&#41;: self._data = container self._dft = dft self._nums = 0 &#160; def __setitem__&#40;self, index, value&#41;: try: i, j = [...]]]></description>
			<content:encoded><![CDATA[<p>推荐系统中经常需要处理类似user_id, item_id, rating这样的数据，其实就是数学里面的稀疏矩阵，scipy中提供了sparse模块来解决这个问题，但scipy.sparse有很多问题不太合用：1、不能很好的同时支持data[i, ...]、data[..., j]、data[i, j]快速切片；2、由于数据保存在内存中，不能很好的支持海量数据处理。</p>
<p>要支持data[i, ...]、data[..., j]的快速切片，需要i或者j的数据集中存储；同时，为了保存海量的数据，也需要把数据的一部分放在硬盘上，用内存做buffer。这里的解决方案比较简单，用一个类Dict的东西来存储数据，对于某个i（比如9527），它的数据保存在dict['i9527']里面，同样的，对于某个j（比如3306），它的全部数据保存在dict['j3306']里面，需要取出data[9527, ...]的时候，只要取出dict['i9527']即可，dict['i9527']原本是一个dict对象，储存某个j对应的值，为了节省内存空间，我们把这个dict以二进制字符串形式存储，直接上代码：</p>
<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #483d8b;">''</span><span style="color: #483d8b;">'
Sparse Matrix
'</span><span style="color: #483d8b;">''</span>
<span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">struct</span>
<span style="color: #ff7700;font-weight:bold;">import</span> numpy <span style="color: #ff7700;font-weight:bold;">as</span> np
<span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">bsddb</span>
<span style="color: #ff7700;font-weight:bold;">from</span> <span style="color: #dc143c;">cStringIO</span> <span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">StringIO</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">class</span> DictMatrix<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>:
    <span style="color: #ff7700;font-weight:bold;">def</span> <span style="color: #0000cd;">__init__</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span>, container = <span style="color: black;">&#123;</span><span style="color: black;">&#125;</span>, dft = <span style="color: #ff4500;">0.0</span><span style="color: black;">&#41;</span>:
        <span style="color: #008000;">self</span>._data  = container
        <span style="color: #008000;">self</span>._dft   = dft
        <span style="color: #008000;">self</span>._nums  = <span style="color: #ff4500;">0</span>
&nbsp;
    <span style="color: #ff7700;font-weight:bold;">def</span> <span style="color: #0000cd;">__setitem__</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span>, index, value<span style="color: black;">&#41;</span>:
        <span style="color: #ff7700;font-weight:bold;">try</span>:
            i, j = index
        <span style="color: #ff7700;font-weight:bold;">except</span>:
            <span style="color: #ff7700;font-weight:bold;">raise</span> <span style="color: #008000;">IndexError</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'invalid index'</span><span style="color: black;">&#41;</span>
&nbsp;
        ik = <span style="color: black;">&#40;</span><span style="color: #483d8b;">'i%d'</span> <span style="color: #66cc66;">%</span> i<span style="color: black;">&#41;</span>
        <span style="color: #808080; font-style: italic;"># 为了节省内存，我们把j, value打包成字二进制字符串</span>
        ib = <span style="color: #dc143c;">struct</span>.<span style="color: black;">pack</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'if'</span>, j, value<span style="color: black;">&#41;</span>
        jk = <span style="color: black;">&#40;</span><span style="color: #483d8b;">'j%d'</span> <span style="color: #66cc66;">%</span> j<span style="color: black;">&#41;</span>
        jb = <span style="color: #dc143c;">struct</span>.<span style="color: black;">pack</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'if'</span>, i, value<span style="color: black;">&#41;</span>
&nbsp;
        <span style="color: #ff7700;font-weight:bold;">try</span>:
            <span style="color: #008000;">self</span>._data<span style="color: black;">&#91;</span>ik<span style="color: black;">&#93;</span> += ib
        <span style="color: #ff7700;font-weight:bold;">except</span>:
            <span style="color: #008000;">self</span>._data<span style="color: black;">&#91;</span>ik<span style="color: black;">&#93;</span> = ib
        <span style="color: #ff7700;font-weight:bold;">try</span>:
            <span style="color: #008000;">self</span>._data<span style="color: black;">&#91;</span>jk<span style="color: black;">&#93;</span> += jb
        <span style="color: #ff7700;font-weight:bold;">except</span>:
            <span style="color: #008000;">self</span>._data<span style="color: black;">&#91;</span>jk<span style="color: black;">&#93;</span> = jb
        <span style="color: #008000;">self</span>._nums += <span style="color: #ff4500;">1</span>
&nbsp;
    <span style="color: #ff7700;font-weight:bold;">def</span> <span style="color: #0000cd;">__getitem__</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span>, index<span style="color: black;">&#41;</span>:
        <span style="color: #ff7700;font-weight:bold;">try</span>:
            i, j = index
        <span style="color: #ff7700;font-weight:bold;">except</span>:
            <span style="color: #ff7700;font-weight:bold;">raise</span> <span style="color: #008000;">IndexError</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'invalid index'</span><span style="color: black;">&#41;</span>
&nbsp;
        <span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: black;">&#40;</span><span style="color: #008000;">isinstance</span><span style="color: black;">&#40;</span>i, <span style="color: #008000;">int</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>:
            ik = <span style="color: black;">&#40;</span><span style="color: #483d8b;">'i%d'</span> <span style="color: #66cc66;">%</span> i<span style="color: black;">&#41;</span>
            <span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: #ff7700;font-weight:bold;">not</span> <span style="color: #008000;">self</span>._data.<span style="color: black;">has_key</span><span style="color: black;">&#40;</span>ik<span style="color: black;">&#41;</span>: <span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: #008000;">self</span>._dft
            ret = <span style="color: #008000;">dict</span><span style="color: black;">&#40;</span>np.<span style="color: black;">fromstring</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span>._data<span style="color: black;">&#91;</span>ik<span style="color: black;">&#93;</span>, dtype = <span style="color: #483d8b;">'i4,f4'</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
            <span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: black;">&#40;</span><span style="color: #008000;">isinstance</span><span style="color: black;">&#40;</span>j, <span style="color: #008000;">int</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>: <span style="color: #ff7700;font-weight:bold;">return</span> ret.<span style="color: black;">get</span><span style="color: black;">&#40;</span>j, <span style="color: #008000;">self</span>._dft<span style="color: black;">&#41;</span>
&nbsp;
        <span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: black;">&#40;</span><span style="color: #008000;">isinstance</span><span style="color: black;">&#40;</span>j, <span style="color: #008000;">int</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>:
            jk = <span style="color: black;">&#40;</span><span style="color: #483d8b;">'j%d'</span> <span style="color: #66cc66;">%</span> j<span style="color: black;">&#41;</span>
            <span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: #ff7700;font-weight:bold;">not</span> <span style="color: #008000;">self</span>._data.<span style="color: black;">has_key</span><span style="color: black;">&#40;</span>jk<span style="color: black;">&#41;</span>: <span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: #008000;">self</span>._dft
            ret = <span style="color: #008000;">dict</span><span style="color: black;">&#40;</span>np.<span style="color: black;">fromstring</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span>._data<span style="color: black;">&#91;</span>jk<span style="color: black;">&#93;</span>, dtype = <span style="color: #483d8b;">'i4,f4'</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
&nbsp;
        <span style="color: #ff7700;font-weight:bold;">return</span> ret
&nbsp;
    <span style="color: #ff7700;font-weight:bold;">def</span> <span style="color: #0000cd;">__len__</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:
        <span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: #008000;">self</span>._nums
&nbsp;
    <span style="color: #ff7700;font-weight:bold;">def</span> <span style="color: #0000cd;">__iter__</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:
        <span style="color: #ff7700;font-weight:bold;">pass</span>
&nbsp;
    <span style="color: #483d8b;">''</span><span style="color: #483d8b;">'
    从文件中生成matrix
    考虑到dbm读写的性能不如内存，我们做了一些缓存，每1000W次批量写入一次
    考虑到字符串拼接性能不太好，我们直接用StringIO来做拼接
    '</span><span style="color: #483d8b;">''</span>
    <span style="color: #ff7700;font-weight:bold;">def</span> from_file<span style="color: black;">&#40;</span><span style="color: #008000;">self</span>, fp, sep = <span style="color: #483d8b;">'<span style="color: #000099; font-weight: bold;">\t</span>'</span><span style="color: black;">&#41;</span>:
        cnt = <span style="color: #ff4500;">0</span>
        cache = <span style="color: black;">&#123;</span><span style="color: black;">&#125;</span>
        <span style="color: #ff7700;font-weight:bold;">for</span> l <span style="color: #ff7700;font-weight:bold;">in</span> fp:
            <span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: #ff4500;">10000000</span> == cnt:
                <span style="color: #008000;">self</span>._flush<span style="color: black;">&#40;</span>cache<span style="color: black;">&#41;</span>
                cnt = <span style="color: #ff4500;">0</span>
                cache = <span style="color: black;">&#123;</span><span style="color: black;">&#125;</span>
            i, j, v = <span style="color: black;">&#91;</span><span style="color: #008000;">float</span><span style="color: black;">&#40;</span>i<span style="color: black;">&#41;</span> <span style="color: #ff7700;font-weight:bold;">for</span> i <span style="color: #ff7700;font-weight:bold;">in</span> l.<span style="color: black;">split</span><span style="color: black;">&#40;</span>sep<span style="color: black;">&#41;</span><span style="color: black;">&#93;</span>
&nbsp;
            ik = <span style="color: black;">&#40;</span><span style="color: #483d8b;">'i%d'</span> <span style="color: #66cc66;">%</span> i<span style="color: black;">&#41;</span>
            ib = <span style="color: #dc143c;">struct</span>.<span style="color: black;">pack</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'if'</span>, j, v<span style="color: black;">&#41;</span>
            jk = <span style="color: black;">&#40;</span><span style="color: #483d8b;">'j%d'</span> <span style="color: #66cc66;">%</span> j<span style="color: black;">&#41;</span>
            jb = <span style="color: #dc143c;">struct</span>.<span style="color: black;">pack</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'if'</span>, i, v<span style="color: black;">&#41;</span>
&nbsp;
            <span style="color: #ff7700;font-weight:bold;">try</span>:
                cache<span style="color: black;">&#91;</span>ik<span style="color: black;">&#93;</span>.<span style="color: black;">write</span><span style="color: black;">&#40;</span>ib<span style="color: black;">&#41;</span>
            <span style="color: #ff7700;font-weight:bold;">except</span>:
                cache<span style="color: black;">&#91;</span>ik<span style="color: black;">&#93;</span> = <span style="color: #dc143c;">StringIO</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
                cache<span style="color: black;">&#91;</span>ik<span style="color: black;">&#93;</span>.<span style="color: black;">write</span><span style="color: black;">&#40;</span>ib<span style="color: black;">&#41;</span>
&nbsp;
            <span style="color: #ff7700;font-weight:bold;">try</span>:
                cache<span style="color: black;">&#91;</span>jk<span style="color: black;">&#93;</span>.<span style="color: black;">write</span><span style="color: black;">&#40;</span>jb<span style="color: black;">&#41;</span>
            <span style="color: #ff7700;font-weight:bold;">except</span>:
                cache<span style="color: black;">&#91;</span>jk<span style="color: black;">&#93;</span> = <span style="color: #dc143c;">StringIO</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
                cache<span style="color: black;">&#91;</span>jk<span style="color: black;">&#93;</span>.<span style="color: black;">write</span><span style="color: black;">&#40;</span>jb<span style="color: black;">&#41;</span>
&nbsp;
            cnt += <span style="color: #ff4500;">1</span>
            <span style="color: #008000;">self</span>._nums += <span style="color: #ff4500;">1</span>
&nbsp;
        <span style="color: #008000;">self</span>._flush<span style="color: black;">&#40;</span>cache<span style="color: black;">&#41;</span>
        <span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: #008000;">self</span>._nums
&nbsp;
    <span style="color: #ff7700;font-weight:bold;">def</span> _flush<span style="color: black;">&#40;</span><span style="color: #008000;">self</span>, cache<span style="color: black;">&#41;</span>:
        <span style="color: #ff7700;font-weight:bold;">for</span> k,v <span style="color: #ff7700;font-weight:bold;">in</span> cache.<span style="color: black;">items</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>:
            v.<span style="color: black;">seek</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#41;</span>
            s = v.<span style="color: black;">read</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
            <span style="color: #ff7700;font-weight:bold;">try</span>:
                <span style="color: #008000;">self</span>._data<span style="color: black;">&#91;</span>k<span style="color: black;">&#93;</span> += s
            <span style="color: #ff7700;font-weight:bold;">except</span>:
                <span style="color: #008000;">self</span>._data<span style="color: black;">&#91;</span>k<span style="color: black;">&#93;</span> = s
&nbsp;
<span style="color: #ff7700;font-weight:bold;">if</span> __name__ == <span style="color: #483d8b;">'__main__'</span>:
    db = <span style="color: #dc143c;">bsddb</span>.<span style="color: black;">btopen</span><span style="color: black;">&#40;</span><span style="color: #008000;">None</span>, cachesize = <span style="color: #ff4500;">268435456</span><span style="color: black;">&#41;</span>
    data = DictMatrix<span style="color: black;">&#40;</span>db<span style="color: black;">&#41;</span>
    data.<span style="color: black;">from_file</span><span style="color: black;">&#40;</span><span style="color: #008000;">open</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'/path/to/log.txt'</span>, <span style="color: #483d8b;">'r'</span><span style="color: black;">&#41;</span>, <span style="color: #483d8b;">','</span><span style="color: black;">&#41;</span></pre></div></div>
<p>测试4500W条rating数据（整形,整型,浮点格式），922MB文本文件导入，采用内存dict储存的话，12分钟构建完毕，消耗内存1.2G，采用示例代码中的bdb存储，20分钟构建完毕，占用内存300～400MB左右，比cachesize大不了多少，数据读取测试：</p>
<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">timeit</span>
<span style="color: #dc143c;">timeit</span>.<span style="color: black;">Timer</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'foo = __main__.data[9527, ...]'</span>, <span style="color: #483d8b;">'import __main__'</span><span style="color: black;">&#41;</span>.<span style="color: #dc143c;">timeit</span><span style="color: black;">&#40;</span>number = <span style="color: #ff4500;">1000</span><span style="color: black;">&#41;</span></pre></div></div>
<p>消耗1.4788秒，大概读取一条数据1.5ms。</p>
<p>采用类Dict来存储数据的另一个好处是你可以随便用内存Dict或者其他任何形式的DBM，甚至传说中的Tokyo Cabinet&#8230;.</p>
<p>好的，码完收工。</p>
]]></content:encoded>
			<wfw:commentRss>http://www.fuchaoqun.com/2010/03/python-sparse-matrix/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>从140秒到2秒的优化</title>
		<link>http://www.fuchaoqun.com/2009/11/from-140s-to-2s/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=from-140s-to-2s</link>
		<comments>http://www.fuchaoqun.com/2009/11/from-140s-to-2s/#comments</comments>
		<pubDate>Thu, 12 Nov 2009 10:50:58 +0000</pubDate>
		<dc:creator>超群.com</dc:creator>
				<category><![CDATA[Python]]></category>
		<category><![CDATA[bitarray]]></category>
		<guid isPermaLink="false">http://www.fuchaoqun.com/?p=259</guid>
		<description><![CDATA[从2亿个0~2,000,000,000之间的数字样本中找出不重复的记录总数，首先想到的是bloom filter，转念一想既然全都是数字，bloom filter有点太重，bitarray也许更有效，于是第一个版本出来，部分代码如下： ba = bitarray&#40;212**4&#41; cnt = 0 for i in data: if &#40;not ba&#91;i&#93;&#41;: cnt += 1 ba&#91;i&#93; = True print cnt 大概需要140s左右，觉得if (not ba[i]):这个比较费，改了第二版： for i in data: ba&#91;i&#93; = True print ba.count&#40;&#41; 速度有所提升，到了120s左右，开始打起多核运算的主意了，山寨了一个map-reduce，首先通过maper把数据按照除4得余分成4份： def maper&#40;data&#41;: map_data = &#40;array&#40;'I'&#41;,array&#40;'I'&#41;,array&#40;'I'&#41;,array&#40;'I'&#41;&#41; for i in data: m = i % 4 map_data&#91;m&#93;.append&#40;i&#41; return map_data 然后起了一个4个进程的woker [...]]]></description>
			<content:encoded><![CDATA[<p>从2亿个0~2,000,000,000之间的数字样本中找出不重复的记录总数，首先想到的是bloom filter，转念一想既然全都是数字，bloom filter有点太重，<a href="http://pypi.python.org/pypi/bitarray" target="_blank">bitarray</a>也许更有效，于是第一个版本出来，部分代码如下：</p>
<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;">ba = bitarray<span style="color: black;">&#40;</span><span style="color: #ff4500;">212</span><span style="color: #66cc66;">**</span><span style="color: #ff4500;">4</span><span style="color: black;">&#41;</span>
cnt = <span style="color: #ff4500;">0</span>
<span style="color: #ff7700;font-weight:bold;">for</span> i <span style="color: #ff7700;font-weight:bold;">in</span> data:
    <span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: black;">&#40;</span><span style="color: #ff7700;font-weight:bold;">not</span> ba<span style="color: black;">&#91;</span>i<span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>:
        cnt += <span style="color: #ff4500;">1</span>
        ba<span style="color: black;">&#91;</span>i<span style="color: black;">&#93;</span> = <span style="color: #008000;">True</span>
<span style="color: #ff7700;font-weight:bold;">print</span> cnt</pre></div></div>
<p>大概需要140s左右，觉得if (not ba[i]):这个比较费，改了第二版：</p>
<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">for</span> i <span style="color: #ff7700;font-weight:bold;">in</span> data:
    ba<span style="color: black;">&#91;</span>i<span style="color: black;">&#93;</span> = <span style="color: #008000;">True</span>
<span style="color: #ff7700;font-weight:bold;">print</span> ba.<span style="color: black;">count</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span></pre></div></div>
<p>速度有所提升，到了120s左右，开始打起多核运算的主意了，山寨了一个map-reduce，首先通过maper把数据按照除4得余分成4份：</p>
<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">def</span> maper<span style="color: black;">&#40;</span>data<span style="color: black;">&#41;</span>:
    map_data = <span style="color: black;">&#40;</span><span style="color: #dc143c;">array</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'I'</span><span style="color: black;">&#41;</span>,<span style="color: #dc143c;">array</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'I'</span><span style="color: black;">&#41;</span>,<span style="color: #dc143c;">array</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'I'</span><span style="color: black;">&#41;</span>,<span style="color: #dc143c;">array</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'I'</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
    <span style="color: #ff7700;font-weight:bold;">for</span> i <span style="color: #ff7700;font-weight:bold;">in</span> data:
        m = i <span style="color: #66cc66;">%</span> <span style="color: #ff4500;">4</span>
        map_data<span style="color: black;">&#91;</span>m<span style="color: black;">&#93;</span>.<span style="color: black;">append</span><span style="color: black;">&#40;</span>i<span style="color: black;">&#41;</span>
    <span style="color: #ff7700;font-weight:bold;">return</span> map_data</pre></div></div>
<p>然后起了一个4个进程的woker pool分别计算，最后把结果汇总：</p>
<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">def</span> worker<span style="color: black;">&#40;</span>data<span style="color: black;">&#41;</span>:
    counter = bitarray<span style="color: black;">&#40;</span><span style="color: #ff4500;">256</span><span style="color: #66cc66;">**</span><span style="color: #ff4500;">4</span><span style="color: black;">&#41;</span>
    <span style="color: #ff7700;font-weight:bold;">for</span> i <span style="color: #ff7700;font-weight:bold;">in</span> data:counter<span style="color: black;">&#91;</span>i<span style="color: black;">&#93;</span> = <span style="color: #008000;">True</span>
    <span style="color: #ff7700;font-weight:bold;">return</span> counter.<span style="color: black;">count</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
&nbsp;
p = Pool<span style="color: black;">&#40;</span><span style="color: #ff4500;">4</span><span style="color: black;">&#41;</span>
result = p.<span style="color: #008000;">map</span><span style="color: black;">&#40;</span>worker, data<span style="color: black;">&#41;</span></pre></div></div>
<p>速度提高明显，到了50s左右，这个做法的问题是两次遍历：map的时候一次、reduce的时候又一次，于是开始想办法解决，把文件直接分开运算，不再map，把最后的结果做一下位或再计数：</p>
<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;">p = Pool<span style="color: black;">&#40;</span><span style="color: #ff4500;">4</span><span style="color: black;">&#41;</span>
result = p.<span style="color: #008000;">map</span><span style="color: black;">&#40;</span>worker, data<span style="color: black;">&#41;</span>
<span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: black;">&#40;</span>result<span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span> | result<span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span> | result<span style="color: black;">&#91;</span><span style="color: #ff4500;">2</span><span style="color: black;">&#93;</span> | result<span style="color: black;">&#91;</span><span style="color: #ff4500;">3</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>.<span style="color: black;">count</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span></pre></div></div>
<p>到了26s左右，可能Python在进程间交换大数据量效率不是太好，再优化的空间有限，想起之前用Python的科学运算库做过数据挖掘，能不能用那个库试试，于是有了NumPy的版本：</p>
<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">import</span> numpy <span style="color: #ff7700;font-weight:bold;">as</span> np
<span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #008000;">len</span><span style="color: black;">&#40;</span>np.<span style="color: black;">unique</span><span style="color: black;">&#40;</span>np.<span style="color: black;">fromfile</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'/path/to/data.dat'</span>, np.<span style="color: black;">uint32</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span></pre></div></div>
<p>全部程序就这两行，速度到了12s，让人崩溃，NumPy的底层大多是C的实现，对代码做了一个profile，发现NumPy用了sort，有点浪费，如果我用C实现一部分功能的话效果应该会不错，注意到代码中有for i in data，data中有2亿条，就循环调用了2亿次，尝试把这个调用都封装在C里面，使用C级别的循环，于是用C扩展了一下bitarray包：</p>
<div class="wp_syntax"><div class="code"><pre class="c" style="font-family:monospace;"><span style="color: #993333;">static</span> PyObject <span style="color: #339933;">*</span>
bitarray_fromarray<span style="color: #009900;">&#40;</span>bitarrayobject <span style="color: #339933;">*</span>self<span style="color: #339933;">,</span> PyObject <span style="color: #339933;">*</span>pyo<span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#123;</span>
    <span style="color: #993333;">unsigned</span> <span style="color: #993333;">int</span> <span style="color: #339933;">*</span>l<span style="color: #339933;">;</span>
    idx_t n1<span style="color: #339933;">;</span>
    Py_ssize_t nbytes<span style="color: #339933;">,</span> nitems<span style="color: #339933;">,</span> i<span style="color: #339933;">;</span>
    <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span>PyObject_AsReadBuffer<span style="color: #009900;">&#40;</span>pyo<span style="color: #339933;">,</span> <span style="color: #009900;">&#40;</span><span style="color: #993333;">const</span> <span style="color: #993333;">void</span> <span style="color: #339933;">**</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">&amp;</span>l<span style="color: #339933;">,</span> <span style="color: #339933;">&amp;</span>nbytes<span style="color: #009900;">&#41;</span> <span style="color: #339933;">!=</span> <span style="color: #0000dd;">0</span><span style="color: #009900;">&#41;</span>
        <span style="color: #b1b100;">return</span> Py_False<span style="color: #339933;">;</span>
    nitems <span style="color: #339933;">=</span> nbytes<span style="color: #339933;">/</span><span style="color: #993333;">sizeof</span><span style="color: #009900;">&#40;</span><span style="color: #993333;">unsigned</span> <span style="color: #993333;">int</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #b1b100;">for</span> <span style="color: #009900;">&#40;</span>i<span style="color: #339933;">=</span><span style="color: #0000dd;">0</span><span style="color: #339933;">;</span> i<span style="color: #339933;">&lt;</span>nitems<span style="color: #339933;">;</span> i<span style="color: #339933;">++</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
        <span style="color: #339933;">*</span><span style="color: #009900;">&#40;</span>self<span style="color: #339933;">-&gt;</span>ob_item <span style="color: #339933;">+</span> l<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span> <span style="color: #339933;">/</span> <span style="color: #0000dd;">8</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">|=</span> <span style="color: #009900;">&#40;</span><span style="color: #009900;">&#40;</span><span style="color: #993333;">char</span><span style="color: #009900;">&#41;</span> <span style="color: #0000dd;">1</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">&lt;&lt;</span> <span style="color: #009900;">&#40;</span>l<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">%</span><span style="color:#800080;">8</span><span style="color: #339933;">;</span>
    <span style="color: #009900;">&#125;</span>
    n1 <span style="color: #339933;">=</span> count<span style="color: #009900;">&#40;</span>self<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #b1b100;">return</span> PyLong_FromLongLong<span style="color: #009900;">&#40;</span>n1<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
<span style="color: #009900;">&#125;</span></pre></div></div>
<p>直接读取文件buffer到bitarray，python程序就变成了：</p>
<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">from</span> bitarray <span style="color: #ff7700;font-weight:bold;">import</span> bitarray
counter = bitarray<span style="color: black;">&#40;</span><span style="color: #ff4500;">212</span> <span style="color: #66cc66;">**</span> <span style="color: #ff4500;">4</span><span style="color: black;">&#41;</span>
fp = <span style="color: #008000;">open</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'/path/to/data.datbk'</span>, <span style="color: #483d8b;">'rb'</span><span style="color: black;">&#41;</span>
un = counter.<span style="color: black;">fromarray</span><span style="color: black;">&#40;</span>fp.<span style="color: black;">read</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
<span style="color: #ff7700;font-weight:bold;">print</span> un</pre></div></div>
<p>一共5行代码，速度到了2s内，收工。</p>
]]></content:encoded>
			<wfw:commentRss>http://www.fuchaoqun.com/2009/11/from-140s-to-2s/feed/</wfw:commentRss>
		<slash:comments>14</slash:comments>
		</item>
		<item>
		<title>Python转换office word文件为HTML</title>
		<link>http://www.fuchaoqun.com/2009/03/use-python-convert-word-to-html-with-win32com/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=use-python-convert-word-to-html-with-win32com</link>
		<comments>http://www.fuchaoqun.com/2009/03/use-python-convert-word-to-html-with-win32com/#comments</comments>
		<pubDate>Thu, 12 Mar 2009 05:57:38 +0000</pubDate>
		<dc:creator>超群.com</dc:creator>
				<category><![CDATA[Python]]></category>
		<category><![CDATA[html]]></category>
		<category><![CDATA[pywin32]]></category>
		<category><![CDATA[win32com]]></category>
		<guid isPermaLink="false">http://chaoqun.17348.com/?p=182</guid>
		<description><![CDATA[这里测试的环境是：windows xp,office 2007,python 2.5.2,pywin32 build 213，原理是利用win32com接口直接调用office API，好处是简单、兼容性好，只要office能处理的，python都可以处理，处理出来的结果和office word里面“另存为”一致。 #!/usr/bin/env python &#160; #coding=utf-8 &#160; from win32com import client as wc &#160; word = wc.Dispatch&#40;'Word.Application'&#41; &#160; doc = word.Documents.Open&#40;'d:/labs/math.doc'&#41; &#160; doc.SaveAs&#40;'d:/labs/math.html', 8&#41; &#160; doc.Close&#40;&#41; &#160; word.Quit&#40;&#41; 关键的就是doc.SaveAs(&#8216;d:/labs/math.html&#8217;, 8)这一行，网上很多文章写成：doc.SaveAs(&#8216;d:/labs/math.html&#8217;, win32com.client.constants.wdFormatHTML)，直接报错： AttributeError: class Constants has no attribute &#8216;wdFormatHTML&#8217; 当然你也可以用上面的代码将word文件转换成任意格式文件（只要office 2007支持，比如将word文件转换成PDF文件，把8改成17即可），下面是office 2007支持的全部文件格式对应表： wdFormatDocument = 0 wdFormatDocument97 = 0 wdFormatDocumentDefault = [...]]]></description>
			<content:encoded><![CDATA[<p>这里测试的环境是：windows xp,office 2007,python 2.5.2,pywin32 build 213，原理是利用win32com接口直接调用office API，好处是简单、兼容性好，只要office能处理的，python都可以处理，处理出来的结果和office word里面“另存为”一致。</p>
<blockquote>
<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #808080; font-style: italic;">#!/usr/bin/env python</span>
&nbsp;
<span style="color: #808080; font-style: italic;">#coding=utf-8</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">from</span> win32com <span style="color: #ff7700;font-weight:bold;">import</span> client <span style="color: #ff7700;font-weight:bold;">as</span> wc
&nbsp;
word = wc.<span style="color: black;">Dispatch</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'Word.Application'</span><span style="color: black;">&#41;</span>
&nbsp;
doc = word.<span style="color: black;">Documents</span>.<span style="color: black;">Open</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'d:/labs/math.doc'</span><span style="color: black;">&#41;</span>
&nbsp;
doc.<span style="color: black;">SaveAs</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'d:/labs/math.html'</span>, <span style="color: #ff4500;">8</span><span style="color: black;">&#41;</span>
&nbsp;
doc.<span style="color: black;">Close</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
&nbsp;
word.<span style="color: black;">Quit</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span></pre></div></div>
</blockquote>
<p>关键的就是doc.SaveAs(&#8216;d:/labs/math.html&#8217;, 8)这一行，网上很多文章写成：doc.SaveAs(&#8216;d:/labs/math.html&#8217;, win32com.client.constants.wdFormatHTML)，直接报错：</p>
<blockquote><p>AttributeError: class Constants has no attribute &#8216;wdFormatHTML&#8217;</p></blockquote>
<p>当然你也可以用上面的代码将word文件转换成任意格式文件（只要office 2007支持，比如将word文件转换成PDF文件，把8改成17即可），下面是office 2007支持的全部文件格式对应表：</p>
<pre>wdFormatDocument                    =  0
wdFormatDocument97                  =  0
wdFormatDocumentDefault             = 16
wdFormatDOSText                     =  4
wdFormatDOSTextLineBreaks           =  5
wdFormatEncodedText                 =  7
wdFormatFilteredHTML                = 10
wdFormatFlatXML                     = 19
wdFormatFlatXMLMacroEnabled         = 20
wdFormatFlatXMLTemplate             = 21
wdFormatFlatXMLTemplateMacroEnabled = 22
wdFormatHTML                        =  8
wdFormatPDF                         = 17
wdFormatRTF                         =  6
wdFormatTemplate                    =  1
wdFormatTemplate97                  =  1
wdFormatText                        =  2
wdFormatTextLineBreaks              =  3
wdFormatUnicodeText                 =  7
wdFormatWebArchive                  =  9
wdFormatXML                         = 11
wdFormatXMLDocument                 = 12
wdFormatXMLDocumentMacroEnabled     = 13
wdFormatXMLTemplate                 = 14
wdFormatXMLTemplateMacroEnabled     = 15
wdFormatXPS                         = 18</pre>
<p>照着字面意思应该能对应到相应的文件格式，如果你是office 2003可能支持不了这么多格式。word文件转html有两种格式可选wdFormatHTML、wdFormatFilteredHTML（对应数字8、10），区别是如果是wdFormatHTML格式的话，word文件里面的公式等ole对象将会存储成wmf格式，而选用wdFormatFilteredHTML的话公式图片将存储为gif格式，而且目测可以看出用wdFormatFilteredHTML生成的HTML明显比wdFormatHTML要干净许多。</p>
<p>当然你也可以用任意一种语言通过com来调用office API，比如PHP.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.fuchaoqun.com/2009/03/use-python-convert-word-to-html-with-win32com/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>用Python创建线程池</title>
		<link>http://www.fuchaoqun.com/2008/12/python-thread-pool/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=python-thread-pool</link>
		<comments>http://www.fuchaoqun.com/2008/12/python-thread-pool/#comments</comments>
		<pubDate>Tue, 23 Dec 2008 07:43:22 +0000</pubDate>
		<dc:creator>超群.com</dc:creator>
				<category><![CDATA[Python]]></category>
		<category><![CDATA[thread pool]]></category>
		<guid isPermaLink="false">http://chaoqun.17348.com/?p=115</guid>
		<description><![CDATA[本博客所有原创文章采用知识共享署名-非商业性使用-相同方式共享，转载请保留链接http://chaoqun.17348.com/2008/12/python-thread-pool/ 最近在做一些数据挖掘方面的工作，需要对海量的数据进行处理，在目前的硬件环境下，多进程＋多线程的方式对运算时间的减少大有裨益，我用的是Python语言，开发效率高，运算效率也不低。 Python里面成型的线程池可以看以下http://www.chrisarndt.de/projects/threadpool/，看了一下API介绍，应该写的比较完备了，我这里想介绍的是Python线程池实现的原理以及一个简明的线程池代码实例。 我这里是用Queue这个包来实现的，Queue翻译成中文就是队列，全局的，我们把任务放进队列中去，然后开N个线程，每个线程都去队列中取一个任务，执行完了之后告诉系统说我执行完了，然后接着去队列中取下一个任务，直至队列中所有任务取空，退出线程。 这就是一般的线程池实现的原理，下面看一个实际的代码： import time import threading import Queue class Worker(threading.Thread):     def __init__(self, name, queue):         threading.Thread.__init__(self)         self.queue = queue         self.start()     def run(self): # 著名的死循环，保证接着跑下一个任务         while True:             # 队列为空则退出线程             if self.queue.empty():                 break             # 获取一个项目             foo = self.queue.get()             # 延时1S模拟你要做的事情             time.sleep(1)             [...]]]></description>
			<content:encoded><![CDATA[<blockquote><p>本博客所有原创文章采用<a href="http://creativecommons.org/licenses/by-nc-sa/2.5/cn/" target="_blank"><span style="color: #356aa0;">知识共享署名-非商业性使用-相同方式共享</span></a>，转载请保留链接<a href="http://chaoqun.17348.com/2008/12/python-thread-pool/">http://chaoqun.17348.com/2008/12/python-thread-pool/</a></p></blockquote>
<p>最近在做一些数据挖掘方面的工作，需要对海量的数据进行处理，在目前的硬件环境下，多进程＋多线程的方式对运算时间的减少大有裨益，我用的是Python语言，开发效率高，运算效率也不低。</p>
<p>Python里面成型的线程池可以看以下<a href="http://www.chrisarndt.de/projects/threadpool/" target="_blank">http://www.chrisarndt.de/projects/threadpool/</a>，看了一下API介绍，应该写的比较完备了，我这里想介绍的是Python线程池实现的原理以及一个简明的线程池代码实例。</p>
<p>我这里是用Queue这个包来实现的，Queue翻译成中文就是队列，全局的，<strong>我们把任务放进队列中去，然后开N个线程，每个线程都去队列中取一个任务，执行完了之后告诉系统说我执行完了，然后接着去队列中取下一个任务，直至队列中所有任务取空，退出线程。</strong></p>
<p>这就是一般的线程池实现的原理，下面看一个实际的代码：</p>
<blockquote>
<pre>import time
import threading
import Queue</pre>
<pre>class Worker(threading.Thread):
    def __init__(self, name, queue):
        threading.Thread.__init__(self)
        self.queue = queue
        self.start()</pre>
<pre>    def run(self):
        # 著名的死循环，保证接着跑下一个任务
        while True:
            # 队列为空则退出线程
            if self.queue.empty():
                break</pre>
<pre>            # 获取一个项目
            foo = self.queue.get()</pre>
<pre>            # 延时1S模拟你要做的事情
            time.sleep(1)</pre>
<pre>            # 打印
            print self.getName(),':', foo</pre>
<pre>            # 告诉系统说任务完成
            self.queue.task_done()</pre>
<pre># 队列
queue = Queue.Queue()</pre>
<pre># 加入100个任务队列
for i in range(100):
    queue.put(i)</pre>
<pre># 开10个线程
for i in range(10):
    threadName = 'Thread' + str(i)
    Worker(threadName, queue)</pre>
<pre># 所有线程执行完毕后关闭
queue.join()</pre>
</blockquote>
<p>那些成型的代码都非常的复杂，想了解实现原理难度颇大，希望这篇短文能起到拨开云雾的功用。</p>
]]></content:encoded>
			<wfw:commentRss>http://www.fuchaoqun.com/2008/12/python-thread-pool/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>利用orange进行关联规则挖掘</title>
		<link>http://www.fuchaoqun.com/2008/08/data-mining-with-python-orange-association_rule/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=data-mining-with-python-orange-association_rule</link>
		<comments>http://www.fuchaoqun.com/2008/08/data-mining-with-python-orange-association_rule/#comments</comments>
		<pubDate>Tue, 26 Aug 2008 09:26:38 +0000</pubDate>
		<dc:creator>超群.com</dc:creator>
				<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[association]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[mining]]></category>
		<category><![CDATA[orange]]></category>
		<guid isPermaLink="false">http://chaoqun.17348.com/?p=54</guid>
		<description><![CDATA[本博客所有原创文章采用知识共享署名-非商业性使用-相同方式共享，转载请保留链接http://chaoqun.17348.com/2008/08/data-mining-with-python-orange-association_rule/ 最近，趁着项目的间隙，折腾了一阵数据挖掘，在同事的帮助下，对新浪音乐用户的听歌记录进行了一个简易挖掘，希望能根据用户以往的听歌记录，推荐出用户可能感兴趣的其他歌曲。 Orange： 一个模块化的C++数据挖掘包，提供python接口（好像也只提供了python接口）,网址是http://www.ailab.si/orange/ 关联分析： 我这里用的是类似购物篮分析，每个用户的听歌id是一个事务，不熟悉关联分析的同学可以去搜一些相关方面的资料。 数据准备： 简单清洗掉一些“脏”数据（逻辑上有问题的数据，比如某个用户在5s听了200首歌），得到类似下面的数据 15615,355029,750367,762147,803787,805014,999712,999712,999712,1013641,1024215,1028429 871029,952779,962769 1023040,1024077,1024215,1025600 757946,873801,873801,873801 862257,873479 286056,286056,286056,286056,286056,286056,286056,286056,286056,286056 873801,873801,873801,873801,873801,947750,947750 473221,473537,504206,504206,504206,504206,504206,504206 947750,1005430,1005430 974748,1024215 873479,873479,873801,873801,947750,965748,999721,1024215,1024215,1024215,1024215,1024215 873801,873801,873801 每一行是一个用户的听歌记录，没有做去重处理（orange示例中也没有，是不是可能会增加歌曲的权重？不清楚，没有去阅读orange代码），注意文件名一定要以.basket为扩展名，程序中文件地址是d:/datamining/sample.basket。 程序： # 导入orange包 import orange &#160; # 导入数据，注意不需要后缀 data = orange.ExampleTable&#40;&#34;d:/datamining/sample&#34;&#41; &#160; # 挖掘关联规则，输入最低支持度、最低置信度、最大项集数 rules = orange.AssociationRulesSparseInducer&#40;data, support = 0.5, confidence = 0.6, maxItemSets = 1000000&#41; &#160; # 打印出规则来 for r in rules: print [...]]]></description>
			<content:encoded><![CDATA[<blockquote><p>本博客所有原创文章采用<a href="http://creativecommons.org/licenses/by-nc-sa/2.5/cn/" target="_blank">知识共享署名-非商业性使用-相同方式共享</a>，转载请保留链接<a href="http://chaoqun.17348.com/2008/08/data-mining-with-python-orange-association_rule/">http://chaoqun.17348.com/2008/08/data-mining-with-python-orange-association_rule/</a></p></blockquote>
<p>最近，趁着项目的间隙，折腾了一阵数据挖掘，在同事的帮助下，对新浪音乐用户的听歌记录进行了一个简易挖掘，希望能根据用户以往的听歌记录，推荐出用户可能感兴趣的其他歌曲。</p>
<p><strong>Orange</strong>：</p>
<p>一个模块化的C++数据挖掘包，提供python接口（好像也只提供了python接口）,网址是<a href="http://www.ailab.si/orange/" target="_blank">http://www.ailab.si/orange/</a></p>
<p><strong>关联分析</strong>：</p>
<p>我这里用的是类似购物篮分析，每个用户的听歌id是一个事务，不熟悉关联分析的同学可以去搜一些相关方面的资料。</p>
<p><strong>数据准备</strong>：</p>
<p>简单清洗掉一些“脏”数据（逻辑上有问题的数据，比如某个用户在5s听了200首歌），得到类似下面的数据</p>
<blockquote><p>15615,355029,750367,762147,803787,805014,999712,999712,999712,1013641,1024215,1028429<br />
871029,952779,962769<br />
1023040,1024077,1024215,1025600<br />
757946,873801,873801,873801<br />
862257,873479<br />
286056,286056,286056,286056,286056,286056,286056,286056,286056,286056<br />
873801,873801,873801,873801,873801,947750,947750<br />
473221,473537,504206,504206,504206,504206,504206,504206<br />
947750,1005430,1005430<br />
974748,1024215<br />
873479,873479,873801,873801,947750,965748,999721,1024215,1024215,1024215,1024215,1024215<br />
873801,873801,873801</p></blockquote>
<p>每一行是一个用户的听歌记录，没有做去重处理（orange示例中也没有，是不是可能会增加歌曲的权重？不清楚，没有去阅读orange代码），注意文件名一定要以<strong>.basket</strong>为扩展名，程序中文件地址是d:/datamining/sample.basket。</p>
<p><strong>程序</strong>：</p>
<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #808080; font-style: italic;"># 导入orange包</span>
<span style="color: #ff7700;font-weight:bold;">import</span> orange
&nbsp;
<span style="color: #808080; font-style: italic;"># 导入数据，注意不需要后缀</span>
data = orange.<span style="color: black;">ExampleTable</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;d:/datamining/sample&quot;</span><span style="color: black;">&#41;</span>
&nbsp;
<span style="color: #808080; font-style: italic;"># 挖掘关联规则，输入最低支持度、最低置信度、最大项集数</span>
rules = orange.<span style="color: black;">AssociationRulesSparseInducer</span><span style="color: black;">&#40;</span>data, support = <span style="color: #ff4500;">0.5</span>, confidence = <span style="color: #ff4500;">0.6</span>, maxItemSets = <span style="color: #ff4500;">1000000</span><span style="color: black;">&#41;</span>
&nbsp;
<span style="color: #808080; font-style: italic;">#  打印出规则来</span>
<span style="color: #ff7700;font-weight:bold;">for</span> r <span style="color: #ff7700;font-weight:bold;">in</span> rules:
    <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;%5.3f   %5.3f   %s&quot;</span> <span style="color: #66cc66;">%</span> <span style="color: black;">&#40;</span>r.<span style="color: black;">support</span>, r.<span style="color: black;">confidence</span>, r<span style="color: black;">&#41;</span></pre></div></div>
<p>是不是非常的简单？Orange实现的是<a href="http://en.wikipedia.org/wiki/Apriori_algorithm" target="_blank">Apriori算法</a>，由于Apriori算法的问题，一旦数据量非常大，你就等着你的内存消耗光吧，反正我这边要是把所有数据都导入进去的话，笔记本1.5G的的内存根本不够用，可以试试<a href="http://www.guwendong.cn/post/2008/fpgrowth_algorithm.html" target="_blank">FP-tree算法</a>，我这边参考文章<a href="http://203.72.2.115/Ejournal/3012000602.pdf" target="_blank">利用sql改良构建fp-tree之技术</a>，已经把fp-tree的前缀路径都找出来了，需要的朋友可以私下找我要，由fp前缀路径挖频繁集需要用到递归，用sql去处理就非常费劲了，所以后面的算法还需要自己去探索。</p>
<p>居于关联规则的挖掘就告一段落，因为算法的计算复杂度非常高，效果倒不是太好（因为对于音乐，用户可能听多遍，这样打分就不一样，可能用关联规则去挖电影类的数据比较好，因为电影一般最多就看一遍），现在研究的是<a href="http://en.wikipedia.org/wiki/Collaborative_filtering" target="_blank">协同过滤</a>，如果不出意外的话，一个改良版的PHP+Mysql实现<a href="http://en.wikipedia.org/wiki/Slope_One" target="_blank">slope one</a>算法过几天就要出来了，到时候我会开源出来的。</p>
]]></content:encoded>
			<wfw:commentRss>http://www.fuchaoqun.com/2008/08/data-mining-with-python-orange-association_rule/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	</channel>
</rss>
<!-- WP Super Cache is installed but broken. The path to wp-cache-phase1.php in wp-content/advanced-cache.php must be fixed! -->
