<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>
<channel>
	<title>超群.com的博客 &#187; bitarray</title>
	<atom:link href="http://www.fuchaoqun.com/tag/bitarray/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.fuchaoqun.com</link>
	<description></description>
	<lastBuildDate>Thu, 08 Sep 2011 15:08:19 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>从140秒到2秒的优化</title>
		<link>http://www.fuchaoqun.com/2009/11/from-140s-to-2s/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=from-140s-to-2s</link>
		<comments>http://www.fuchaoqun.com/2009/11/from-140s-to-2s/#comments</comments>
		<pubDate>Thu, 12 Nov 2009 10:50:58 +0000</pubDate>
		<dc:creator>超群.com</dc:creator>
				<category><![CDATA[Python]]></category>
		<category><![CDATA[bitarray]]></category>
		<guid isPermaLink="false">http://www.fuchaoqun.com/?p=259</guid>
		<description><![CDATA[从2亿个0~2,000,000,000之间的数字样本中找出不重复的记录总数，首先想到的是bloom filter，转念一想既然全都是数字，bloom filter有点太重，bitarray也许更有效，于是第一个版本出来，部分代码如下： ba = bitarray&#40;212**4&#41; cnt = 0 for i in data: if &#40;not ba&#91;i&#93;&#41;: cnt += 1 ba&#91;i&#93; = True print cnt 大概需要140s左右，觉得if (not ba[i]):这个比较费，改了第二版： for i in data: ba&#91;i&#93; = True print ba.count&#40;&#41; 速度有所提升，到了120s左右，开始打起多核运算的主意了，山寨了一个map-reduce，首先通过maper把数据按照除4得余分成4份： def maper&#40;data&#41;: map_data = &#40;array&#40;'I'&#41;,array&#40;'I'&#41;,array&#40;'I'&#41;,array&#40;'I'&#41;&#41; for i in data: m = i % 4 map_data&#91;m&#93;.append&#40;i&#41; return map_data 然后起了一个4个进程的woker [...]]]></description>
			<content:encoded><![CDATA[<p>从2亿个0~2,000,000,000之间的数字样本中找出不重复的记录总数，首先想到的是bloom filter，转念一想既然全都是数字，bloom filter有点太重，<a href="http://pypi.python.org/pypi/bitarray" target="_blank">bitarray</a>也许更有效，于是第一个版本出来，部分代码如下：</p>
<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;">ba = bitarray<span style="color: black;">&#40;</span><span style="color: #ff4500;">212</span><span style="color: #66cc66;">**</span><span style="color: #ff4500;">4</span><span style="color: black;">&#41;</span>
cnt = <span style="color: #ff4500;">0</span>
<span style="color: #ff7700;font-weight:bold;">for</span> i <span style="color: #ff7700;font-weight:bold;">in</span> data:
    <span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: black;">&#40;</span><span style="color: #ff7700;font-weight:bold;">not</span> ba<span style="color: black;">&#91;</span>i<span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>:
        cnt += <span style="color: #ff4500;">1</span>
        ba<span style="color: black;">&#91;</span>i<span style="color: black;">&#93;</span> = <span style="color: #008000;">True</span>
<span style="color: #ff7700;font-weight:bold;">print</span> cnt</pre></div></div>
<p>大概需要140s左右，觉得if (not ba[i]):这个比较费，改了第二版：</p>
<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">for</span> i <span style="color: #ff7700;font-weight:bold;">in</span> data:
    ba<span style="color: black;">&#91;</span>i<span style="color: black;">&#93;</span> = <span style="color: #008000;">True</span>
<span style="color: #ff7700;font-weight:bold;">print</span> ba.<span style="color: black;">count</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span></pre></div></div>
<p>速度有所提升，到了120s左右，开始打起多核运算的主意了，山寨了一个map-reduce，首先通过maper把数据按照除4得余分成4份：</p>
<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">def</span> maper<span style="color: black;">&#40;</span>data<span style="color: black;">&#41;</span>:
    map_data = <span style="color: black;">&#40;</span><span style="color: #dc143c;">array</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'I'</span><span style="color: black;">&#41;</span>,<span style="color: #dc143c;">array</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'I'</span><span style="color: black;">&#41;</span>,<span style="color: #dc143c;">array</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'I'</span><span style="color: black;">&#41;</span>,<span style="color: #dc143c;">array</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'I'</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
    <span style="color: #ff7700;font-weight:bold;">for</span> i <span style="color: #ff7700;font-weight:bold;">in</span> data:
        m = i <span style="color: #66cc66;">%</span> <span style="color: #ff4500;">4</span>
        map_data<span style="color: black;">&#91;</span>m<span style="color: black;">&#93;</span>.<span style="color: black;">append</span><span style="color: black;">&#40;</span>i<span style="color: black;">&#41;</span>
    <span style="color: #ff7700;font-weight:bold;">return</span> map_data</pre></div></div>
<p>然后起了一个4个进程的woker pool分别计算，最后把结果汇总：</p>
<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">def</span> worker<span style="color: black;">&#40;</span>data<span style="color: black;">&#41;</span>:
    counter = bitarray<span style="color: black;">&#40;</span><span style="color: #ff4500;">256</span><span style="color: #66cc66;">**</span><span style="color: #ff4500;">4</span><span style="color: black;">&#41;</span>
    <span style="color: #ff7700;font-weight:bold;">for</span> i <span style="color: #ff7700;font-weight:bold;">in</span> data:counter<span style="color: black;">&#91;</span>i<span style="color: black;">&#93;</span> = <span style="color: #008000;">True</span>
    <span style="color: #ff7700;font-weight:bold;">return</span> counter.<span style="color: black;">count</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
&nbsp;
p = Pool<span style="color: black;">&#40;</span><span style="color: #ff4500;">4</span><span style="color: black;">&#41;</span>
result = p.<span style="color: #008000;">map</span><span style="color: black;">&#40;</span>worker, data<span style="color: black;">&#41;</span></pre></div></div>
<p>速度提高明显，到了50s左右，这个做法的问题是两次遍历：map的时候一次、reduce的时候又一次，于是开始想办法解决，把文件直接分开运算，不再map，把最后的结果做一下位或再计数：</p>
<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;">p = Pool<span style="color: black;">&#40;</span><span style="color: #ff4500;">4</span><span style="color: black;">&#41;</span>
result = p.<span style="color: #008000;">map</span><span style="color: black;">&#40;</span>worker, data<span style="color: black;">&#41;</span>
<span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: black;">&#40;</span>result<span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span> | result<span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span> | result<span style="color: black;">&#91;</span><span style="color: #ff4500;">2</span><span style="color: black;">&#93;</span> | result<span style="color: black;">&#91;</span><span style="color: #ff4500;">3</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>.<span style="color: black;">count</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span></pre></div></div>
<p>到了26s左右，可能Python在进程间交换大数据量效率不是太好，再优化的空间有限，想起之前用Python的科学运算库做过数据挖掘，能不能用那个库试试，于是有了NumPy的版本：</p>
<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">import</span> numpy <span style="color: #ff7700;font-weight:bold;">as</span> np
<span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #008000;">len</span><span style="color: black;">&#40;</span>np.<span style="color: black;">unique</span><span style="color: black;">&#40;</span>np.<span style="color: black;">fromfile</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'/path/to/data.dat'</span>, np.<span style="color: black;">uint32</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span></pre></div></div>
<p>全部程序就这两行，速度到了12s，让人崩溃，NumPy的底层大多是C的实现，对代码做了一个profile，发现NumPy用了sort，有点浪费，如果我用C实现一部分功能的话效果应该会不错，注意到代码中有for i in data，data中有2亿条，就循环调用了2亿次，尝试把这个调用都封装在C里面，使用C级别的循环，于是用C扩展了一下bitarray包：</p>
<div class="wp_syntax"><div class="code"><pre class="c" style="font-family:monospace;"><span style="color: #993333;">static</span> PyObject <span style="color: #339933;">*</span>
bitarray_fromarray<span style="color: #009900;">&#40;</span>bitarrayobject <span style="color: #339933;">*</span>self<span style="color: #339933;">,</span> PyObject <span style="color: #339933;">*</span>pyo<span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#123;</span>
    <span style="color: #993333;">unsigned</span> <span style="color: #993333;">int</span> <span style="color: #339933;">*</span>l<span style="color: #339933;">;</span>
    idx_t n1<span style="color: #339933;">;</span>
    Py_ssize_t nbytes<span style="color: #339933;">,</span> nitems<span style="color: #339933;">,</span> i<span style="color: #339933;">;</span>
    <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span>PyObject_AsReadBuffer<span style="color: #009900;">&#40;</span>pyo<span style="color: #339933;">,</span> <span style="color: #009900;">&#40;</span><span style="color: #993333;">const</span> <span style="color: #993333;">void</span> <span style="color: #339933;">**</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">&amp;</span>l<span style="color: #339933;">,</span> <span style="color: #339933;">&amp;</span>nbytes<span style="color: #009900;">&#41;</span> <span style="color: #339933;">!=</span> <span style="color: #0000dd;">0</span><span style="color: #009900;">&#41;</span>
        <span style="color: #b1b100;">return</span> Py_False<span style="color: #339933;">;</span>
    nitems <span style="color: #339933;">=</span> nbytes<span style="color: #339933;">/</span><span style="color: #993333;">sizeof</span><span style="color: #009900;">&#40;</span><span style="color: #993333;">unsigned</span> <span style="color: #993333;">int</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #b1b100;">for</span> <span style="color: #009900;">&#40;</span>i<span style="color: #339933;">=</span><span style="color: #0000dd;">0</span><span style="color: #339933;">;</span> i<span style="color: #339933;">&lt;</span>nitems<span style="color: #339933;">;</span> i<span style="color: #339933;">++</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
        <span style="color: #339933;">*</span><span style="color: #009900;">&#40;</span>self<span style="color: #339933;">-&gt;</span>ob_item <span style="color: #339933;">+</span> l<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span> <span style="color: #339933;">/</span> <span style="color: #0000dd;">8</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">|=</span> <span style="color: #009900;">&#40;</span><span style="color: #009900;">&#40;</span><span style="color: #993333;">char</span><span style="color: #009900;">&#41;</span> <span style="color: #0000dd;">1</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">&lt;&lt;</span> <span style="color: #009900;">&#40;</span>l<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">%</span><span style="color:#800080;">8</span><span style="color: #339933;">;</span>
    <span style="color: #009900;">&#125;</span>
    n1 <span style="color: #339933;">=</span> count<span style="color: #009900;">&#40;</span>self<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #b1b100;">return</span> PyLong_FromLongLong<span style="color: #009900;">&#40;</span>n1<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
<span style="color: #009900;">&#125;</span></pre></div></div>
<p>直接读取文件buffer到bitarray，python程序就变成了：</p>
<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">from</span> bitarray <span style="color: #ff7700;font-weight:bold;">import</span> bitarray
counter = bitarray<span style="color: black;">&#40;</span><span style="color: #ff4500;">212</span> <span style="color: #66cc66;">**</span> <span style="color: #ff4500;">4</span><span style="color: black;">&#41;</span>
fp = <span style="color: #008000;">open</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'/path/to/data.datbk'</span>, <span style="color: #483d8b;">'rb'</span><span style="color: black;">&#41;</span>
un = counter.<span style="color: black;">fromarray</span><span style="color: black;">&#40;</span>fp.<span style="color: black;">read</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
<span style="color: #ff7700;font-weight:bold;">print</span> un</pre></div></div>
<p>一共5行代码，速度到了2s内，收工。</p>
]]></content:encoded>
			<wfw:commentRss>http://www.fuchaoqun.com/2009/11/from-140s-to-2s/feed/</wfw:commentRss>
		<slash:comments>14</slash:comments>
		</item>
	</channel>
</rss>
<!-- WP Super Cache is installed but broken. The path to wp-cache-phase1.php in wp-content/advanced-cache.php must be fixed! -->
