<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Metamarkets</title>
	<atom:link href="http://metamarkets.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://metamarkets.com</link>
	<description>Fast Insight for Big Data</description>
	<lastBuildDate>Tue, 21 Feb 2012 23:45:02 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
<xhtml:meta xmlns:xhtml="http://www.w3.org/1999/xhtml" name="robots" content="noindex" />
		<item>
		<title>Analyzing the Wikipedia SOPA Blackout</title>
		<link>http://metamarkets.com/2012/analyzing-the-wikipedia-sopa-blackout/</link>
		<comments>http://metamarkets.com/2012/analyzing-the-wikipedia-sopa-blackout/#comments</comments>
		<pubDate>Wed, 01 Feb 2012 18:37:51 +0000</pubDate>
		<dc:creator>repass</dc:creator>
				<category><![CDATA[Data Visualization]]></category>
		<category><![CDATA[fun]]></category>

		<guid isPermaLink="false">http://metamarkets.com/?p=804</guid>
		<description><![CDATA[Here at Metamarkets, we help our customers quickly make sense of big data sets. To put our solution to the test, I thought it would be interesting to analyze events surrounding the Wikipedia blackout on January 18th in protest of SOPA. &#8230; <a href="http://metamarkets.com/2012/analyzing-the-wikipedia-sopa-blackout/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Here at Metamarkets, we help our customers quickly make sense of big data sets. To put our solution to the test, I thought it would be interesting to analyze events surrounding the Wikipedia blackout on January 18th in protest of <a title="Stop Online Piracy Act" href="http://en.wikipedia.org/wiki/Stop_Online_Piracy_Act" target="_blank">SOPA</a>.</p>
<p>Let's look at edit activity before and after the blackout:</p>
<div><span id="more-804"></span></div>
<div><a href="http://metamarkets.com/wp-content/uploads/2012/01/wikipedia1.png"><img class="aligncenter size-full wp-image-813" title="SOPA - edits" src="http://metamarkets.com/wp-content/uploads/2012/01/wikipedia1.png" alt="" width="640" height="708" /></a></div>
<p>As expected, edits dropped off significantly on January 18th, but they did not completely go down to zero. This is because the blackout only affected English (en) articles.</p>
<p>Edits of other languages were not affected and stayed fairly consistent as you can see below:</p>
<div></div>
<div><a href="http://metamarkets.com/wp-content/uploads/2012/01/wikipedia2.png"><img class="aligncenter size-full wp-image-814" title="SOPA - langauges" src="http://metamarkets.com/wp-content/uploads/2012/01/wikipedia2.png" alt="" width="873" height="772" /></a></div>
<p>Next, I was curious how different geographies would come back online after the blackout, so I looked at edits by city:</p>
<div></div>
<div><a href="http://metamarkets.com/wp-content/uploads/2012/01/wikipedia3.png"><img class="aligncenter size-full wp-image-815" title="SOPA - cities" src="http://metamarkets.com/wp-content/uploads/2012/01/wikipedia3.png" alt="" width="853" height="755" /></a></div>
<p>Sure enough, London and Bangalore were most active once the English language articles were available followed by cities like New York.</p>
<div>Finally, I wanted to investigate which articles were most heavily edited once the blackout ended:</div>
<div><a href="http://metamarkets.com/wp-content/uploads/2012/01/wikipedia4.png"><img class="aligncenter size-full wp-image-816" title="SOPA - articles" src="http://metamarkets.com/wp-content/uploads/2012/01/wikipedia4.png" alt="" width="957" height="795" /></a></div>
<p>It makes sense that recent news articles such as the Costa Concordia disaster as well as events surrounding SOPA would be the first to get updated.</p>
<p>&nbsp;</p>
<p><strong>So why is this interesting?</strong></p>
<p>First, I did this analysis (setup, data loading, processing, exploration, etc. ) all in one afternoon using Metamarkets.</p>
<p>Second, I wanted very granular access to the metrics to truly understand what was going on. In this case, I viewed data at the hourly level, but we have some customers that analyze data in minute increments given the dynamic nature of their businesses.</p>
<p>Third, I was able to quickly pivot and re-orient my analysis based on the specific questions I had at hand. With Metamarkets, I can randomly slice, dice, and drill into data without being constrained by pre-defined navigation paths.</p>
<p>Finally, with live data feeds, I can immediately analyze new data without waiting for pre-processing or re-calculation (not apparent from the screenshots above).</p>
<p><strong>If you would like to play around with this demo yourself, you can sign up <a title="Metamarkets Wikipedia Demo" href="https://dash.metamx.com/wikipedia_editstream/signup" target="_blank">here</a>.</strong></p>
<p>While Metamarkets is initially focused on analyzing online advertising events, we think that our solution is broadly applicable to other industries and problem areas. We would love to hear your thoughts.</p>
]]></content:encoded>
			<wfw:commentRss>http://metamarkets.com/2012/analyzing-the-wikipedia-sopa-blackout/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Munging, Modeling and Visualizing Data with R</title>
		<link>http://metamarkets.com/2012/munging-and-visualizing-data-with-r/</link>
		<comments>http://metamarkets.com/2012/munging-and-visualizing-data-with-r/#comments</comments>
		<pubDate>Thu, 26 Jan 2012 22:29:32 +0000</pubDate>
		<dc:creator>Xavier Léauté</dc:creator>
				<category><![CDATA[Data Visualization]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://metamarkets.com/?p=778</guid>
		<description><![CDATA[Yesterday evening Romy Misra from visual.ly invited us to teach an introductory workshop to R for the San Francisco Data Mining meetup. Todd Holloway was kind enough to host the event at Trulia headquarters. R can be a little daunting &#8230; <a href="http://metamarkets.com/2012/munging-and-visualizing-data-with-r/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p style="text-align: justify;" dir="ltr">Yesterday evening Romy Misra from <a href="http://visual.ly/">visual.ly</a> invited us to teach an introductory workshop to <a href="http://www.r-project.org/">R</a> for the San Francisco Data Mining meetup. Todd Holloway was kind enough to host the event at <a href="http://www.trulia.com/">Trulia</a> headquarters.</p>
<p style="text-align: justify;">R can be a little daunting for beginners, so I wanted to give everyone a quick overview of its capabilities and enough material to get people started. Most importantly, the objective of this interactive session was to give everyone some time to try out some simple examples that would be useful in the future.<span id="more-778"></span></p>
<p style="text-align: justify;">I hope everyone enjoyed learning some fun and easy ways to slice, model and visualize data, and that I piqued their interest enough to start exploring datasets on their own.</p>
<p style="text-align: justify;">Thanks again to Romy and Todd for organizing, as well as Trulia and and O’Reilly Strata Conference for sponsoring this event.</p>
<p style="text-align: justify;">As promised, I am posting the <a href="http://speakerdeck.com/u/metamx/p/r-workshop-for-beginners">slides</a> and the <a href="http://metamx-mdriscol-adhoc.s3.amazonaws.com/gameday/README.R">sample code</a> below, so that everyone can try the examples for themselves.</p>
<p><a href="http://metamx-mdriscol-adhoc.s3.amazonaws.com/gameday/README.R">http://metamx-mdriscol-adhoc.s3.amazonaws.com/gameday/README.R</a></p>
<p>&nbsp;</p>
<p><script src="http://speakerdeck.com/embed/4f21f96821e6f80022010797.js"></script></p>
]]></content:encoded>
			<wfw:commentRss>http://metamarkets.com/2012/munging-and-visualizing-data-with-r/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Scaling the Druid Data Store</title>
		<link>http://metamarkets.com/2012/scaling-druid/</link>
		<comments>http://metamarkets.com/2012/scaling-druid/#comments</comments>
		<pubDate>Thu, 19 Jan 2012 19:28:33 +0000</pubDate>
		<dc:creator>Eric Tschetter</dc:creator>
				<category><![CDATA[Druid]]></category>

		<guid isPermaLink="false">http://metamarkets.com/?p=737</guid>
		<description><![CDATA["Give me a lever long enough... and I shall move the world" &#8212; Archimedes Parallelism is computing’s leverage, a force multiplier acting against the weight of big data.  Cloud-hosted, horizontally scalable systems have the power to move even planetary sized &#8230; <a href="http://metamarkets.com/2012/scaling-druid/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<blockquote><p><em>"Give me a lever long enough... and I shall move the world"</em> &mdash; Archimedes
</p></blockquote>
<p>Parallelism is computing’s leverage, a force multiplier acting against the weight of big data.  Cloud-hosted, horizontally scalable systems have the power to move even planetary sized data sets with speed.</p>
<p>This blog post discusses our efforts to lift one such data set, achieving a <strong>scan rate of 26 billions records per second</strong>, with our distributed, in-memory data store called <a href="http://metamarkets.com/2011/druid-part-deux-three-principles-for-fast-distributed-olap/">Druid</a>.  Our main conclusions are:</p>
<ol>
<li>Horizontally-scalable architectures are an ideal fit for the Cloud</li>
<li>Our data store’s performance scales up well to a 6TB in-memory cluster and degrades gracefully under memory pressure</li>
<li>The flexibility of a Cloud environment enables pain-free tuning of cost versus performance</li>
</ol>
<p>Benchmarking our infrastructure against a big data set in the wild provides validation of the power achievable on a Cloud computing fabric of commodity hardware.</p>
<p>For those who are curious as to what our infrastructure powers, Metamarkets offers a SaaS analytics solution to gaming, social, and digital media firms.  A public example is <a href="http://dash.metamx.com/wikipedia_editstream/signup/">our dashboard for exploring Wikipedia edits</a>.</p>
<p><strong>I) The Data</strong></p>
<p>We began our experiment with 6TB of uncompressed data, representing tens of billions of fact rows, which we aimed to host and make fully explorable through our dashboard.  By way of comparison, the Wikipedia edit feed we host consists of 6GB of uncompressed data, representing ~36 million fact rows.<br />
<span id="more-737"></span><br />
The first hurdle to overcome with a data set of this scale is co-locating the data with the compute power.  Most of the trillions events we’ve analyzed on our platform have been delivered over months of parallel, continuous feeds.  In rare cases, we have had to transform the data locally and sneaker-net the disks to our data center.  Pushing terabytes over a standard office uplink can take weeks.</p>
<p>Once on the cloud, we performed some cardinality analysis to make sure we understood the parameters of the data.  There were more than a dozen dimensions, with cardinalities ranging from tens of millions, to hundreds of thousands, all the way down to tens.  This kind of Zipfian distribution in cardinalities is common in naturally occurring data.  We then computed four metrics for each row (consisting of counts, sums, and averages) and loaded the data up into Druid.</p>
<p>We sharded the data into chunks and then sub-sharded those chunks by the dimension with cardinality &gt;&gt; 1M, creating thousands of shards of roughly 8M fact rows apiece.</p>
<p><strong>II) The Cluster</strong></p>
<p>We then spun up a cluster of compute nodes to load the data up and keep it in memory for querying.  The cluster consisted of 100 nodes, each with 16 cores, 60GB of RAM, 10 GigE ethernet, and 1TB of disk space.  So, collectively the cluster comprised 1600 cores, 6TB of RAM, fast ethernet and more than enough disk space.</p>
<p>With this first cluster, we were successful in delivering an interactive experience on our front-end dashboard, scanning billions of records per second, as the benchmarks below attest.</p>
<p>During the course of our testing, we also reconfigured the cluster in multiple different ways, switching from pure in memory to using memory mapping and pulling back the number of servers to see how performance degrades as we changed the ratio of data served to available RAM.</p>
<p><strong>III) The Benchmarks</strong></p>
<p>First, we’ll provide some benchmarks for our 100-node configuration on simple aggregation queries.  SQL is included to describe what the query is doing.</p>
<p><small><code>"Select count(*) from _table_ where timestamp &gt;= ? and timestamp &lt; ?"</p>
<p>cluster                                            cluster scan rate (rows/sec)           core scan rate<br />
15-core, 100 nodes, in-memory                      26,610,386,635                         17,740,258<br />
15-core,  75 nodes, mmap                           25,224,873,928                         22,422,110<br />
15-core,  50 nodes, mmap                           20,387,152,160                         27,182,870<br />
15-core,  25 nodes, mmap                           11,910,388,894                         31,761,037<br />
4-core,   131 nodes, in-memory                      10,008,730,163                         19,100,630<br />
4-core,   131 nodes, mmap                            10,129,695,120                         19,331,479<br />
4-core,    50 nodes, mmap                            6,626,570,688                         33,132,853</p>
<p></small></code></p>
<p><small>
<ul>
<li><em>The timestamp range encompasses all data.</em></li>
<li><em>15-core is a 16-core machine with 60GB RAM and 1TB of local disk.  The machine was configured to only use 15 threads for processing queries.</em></li>
<li><em> </em><em>4-core is a 4-core machine with 32GB RAM and 1TB of local disk.</em></li>
<li><em>"in-memory" means that the machine was configured to load all data up into the Java heap and have it available for querying</em></li>
<li><em>"mmap" means that the machine was configured to mmap the data instead of load it into the Java heap</em></li>
</ul>
<p></small></p>
<p><small><code><br />
"Select count(*), sum(metric1) from _table_ where timestamp &gt;= ? and timestamp &lt; ?"</p>
<p>cluster                                            cluster scan rate (rows/sec)           core scan rate<br />
15-core, 100 nodes, in-memory                      16,223,081,703                         10,815,388<br />
15-core,  75 nodes, mmap                            9,860,968,285                          8,765,305<br />
15-core,  50 nodes, mmap                            8,093,611,909                         10,791,483<br />
15-core,  25 nodes, mmap                            4,126,502,352                         11,004,006<br />
4-core, 131 nodes, in-memory                        5,755,274,389                         10,983,348<br />
4-core, 131 nodes, mmap                             5,032,185,657                          9,603,408<br />
4-core, 50 nodes, mmap                              1,720,238,609                          8,601,193<br />
</small></code></p>
<p><small><code><br />
"Select count(*), sum(metric1), sum(metric2), sum(metric3), sum(metric4) where timestamp &gt;= ? and timestamp &lt; ?"</p>
<p>cluster                                            cluster scan rate (rows/sec)           core scan rate<br />
15-core, 100 nodes, in-memory                       7,591,604,822                          5,061,070<br />
15-core,  75 nodes, mmap                            4,319,179,995                          3,839,271<br />
15-core,  50 nodes, mmap                            3,406,554,102                          4,542,072<br />
15-core,  25 nodes, mmap                            1,826,451,888                          4,870,538<br />
4-core,  131 nodes, in-memory                       1,936,648,601                          3,695,894<br />
4-core,  131 nodes, mmap                            2,210,367,152                          4,218,258<br />
4-core,   50 nodes, mmap                            1,002,291,562                          5,011,458<br />
</small></code></p>
<p>The first query is just a count and we see the best performance out of our system with it, achieving scan rates of 33M rows/second/core.  At first glance it looks like fewer nodes might actually be outperforming more nodes in the rows/sec/core metric, but that's just because 100 nodes is overprovisioned for the data set.  Druid's concurrency model is based on shards, one thread will scan one shard.  If a node has 15 cores, for example, and handles a query that requires scanning 16 shards, if we assume each shard takes 1 second to process the total time to finish the query will be 2 seconds (1 second for the first 15 shards and 1 second for the 16th shard), decreasing the global scan rate because there are actually a number of cores that are idle.</p>
<p>As we move on to include more aggregations we see performance degrade.  This is because of the column-oriented storage format Druid employs.  For the count(*) queries, it only has to check the timestamp column to satisfy the where clause.  As we add metrics, it has to also load those metric values and scan over them, increasing the amount of memory scanned.  Next, we'll do a top 100 query on our high cardinality dimension:</p>
<p><small><code><br />
"Select high_card_dimension, count(*) AS cnt from _table_ where timestamp &gt;= ? and timestamp &lt; ? group by high_card_dimension order by cnt limit 100;"</p>
<p>cluster                                            cluster scan rate (rows/sec)           core scan rate<br />
15-core, 100 nodes, in-memory                      10,241,183,745                          6,827,456<br />
15-core,  75 nodes, mmap                            4,891,097,559                          4,347,642<br />
15-core,  50 nodes, mmap                            3,616,707,511                          4,822,277<br />
15-core,  25 nodes, mmap                            1,665,053,263                          4,440,142<br />
4-core,   131 nodes, in-memory                      4,388,159,569                          8,374,350<br />
4-core,   131 nodes, mmap                           2,444,344,232                          4,664,779<br />
4-core,    50 nodes, mmap                           1,215,737,558                          6,078,688<br />
</small></code></p>
<p><small><code><br />
"Select high_card_dimension, count(*), sum(metric1) AS cnt from _table_ where timestamp &gt;= ? and timestamp &lt; ? group by high_card_dimension order by cnt limit 100;"</p>
<p>cluster                                            cluster scan rate (rows/sec)           core scan rate<br />
15-core, 100 nodes, in-memory                       7,309,984,688                          4,873,323<br />
15-core,  75 nodes, mmap                            3,333,628,777                          2,963,226<br />
15-core,  50 nodes, mmap                            2,555,300,237                          3,407,067<br />
15-core,  25 nodes, mmap                            1,384,674,717                          3,692,466<br />
4-core,   131 nodes, in-memory                      3,237,907,984                          6,179,214<br />
4-core,   131 nodes, mmap                           1,740,481,380                          3,321,529<br />
4-core,    50 nodes, mmap                             863,170,420                          4,315,852<br />
</small></code></p>
<p><small><code><br />
"Select high_card_dimension, count(*), sum(metric1), sum(metric2), sum(metric3), sum(metric4) AS cnt from _table_ where timestamp &gt;= ? and timestamp &lt; ? group by high_card_dimension order by cnt limit 100;"</p>
<p>cluster                                            cluster scan rate (rows/sec)           core scan rate<br />
15-core, 100 nodes, in-memory                       4,064,424,274                          2,709,616<br />
15-core,  75 nodes, mmap                            2,014,067,386                          1,790,282<br />
15-core,  50 nodes, mmap                            1,499,452,617                          1,999,270<br />
15-core,  25 nodes, mmap                              810,143,518                          2,160,383<br />
4-core,   131 nodes, in-memory                      1,670,214,695                          3,187,433<br />
4-core,   131 nodes, mmap                           1,116,635,690                          2,130,984<br />
4-core,    50 nodes, mmap                             531,389,163                          2,656,946<br />
</small></code></p>
<p>Here we see the superior performance of the in-memory representation when doing top lists versus when doing simple time-based aggregations.  This is an implementation detail, but it's largely because of the differences in accessing simple in-memory pointers, versus scanning and seeking through a flattened data structure (even though it is already largely paged into memory).</p>
<p><strong>IV) Conclusions</strong></p>
<p>Our conclusions are three-fold.  First, we demonstrate that is possible to provide real-time, fully interactive exploration of 6TB of data with a distributed, cloud-hosted commodity hardware.</p>
<p>Second, we highlight the flexibility offered by the cloud.  Letting us stick to our core engineering competencies and having someone else deal with the overhead of running an actual data center is huge.  The fact that we were able to spin up 100 machines, run our benchmarks, kill 25, wait a bit, run benchmarks, kill another 25, wait a bit, run benchmarks, rinse and repeat was just awesome.</p>
<p>Finally, designing an architecture that horizontally scales for performance opens up a set of nobs of cost versus performance.  If we can tolerate response times of 10 seconds instead of 1 second, we can pay less for our processing.  If we can tolerate response times of 1 minute, we pay even less.  Conversely, if we need answers in milliseconds, this is achievable at a higher price point.</p>
<p><strong>V) Using Druid</strong></p>
<p>We currently offer Druid as a hosted service, but are exploring steps to open up the platform to a developer community.  If you would like to explore either using our hosted service or being part of a developer community, please <a href="http://metamarkets.com/contact/">drop us a note</a>.</p>
<p><em> (If you'd like to be part of the Druid Team, <a href="http://metamarkets.com/jobs/">we're currently hiring</a>.)</p>
]]></content:encoded>
			<wfw:commentRss>http://metamarkets.com/2012/scaling-druid/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Data Salon #1: Visualization</title>
		<link>http://metamarkets.com/2012/data-salon-1-visualization/</link>
		<comments>http://metamarkets.com/2012/data-salon-1-visualization/#comments</comments>
		<pubDate>Wed, 18 Jan 2012 01:54:10 +0000</pubDate>
		<dc:creator>Xavier Léauté</dc:creator>
				<category><![CDATA[Data Visualization]]></category>

		<guid isPermaLink="false">http://metamarkets.com/?p=718</guid>
		<description><![CDATA[Last week kicked off the first in a series of “Data Salons” we are holding here at Metamarkets. The goal, as Michael Driscoll put it, is “to bring people together and talk about cool stuff, and keep it small”. This &#8230; <a href="http://metamarkets.com/2012/data-salon-1-visualization/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<div>
<p dir="ltr">Last week kicked off the first in a series of “Data Salons” we are holding here at Metamarkets. The goal, as Michael Driscoll put it, is “to bring people together and talk about cool stuff, and keep it small”. This is something we had been thinking about doing for a while and thanks to the overwhelming response from everyone involved, it was a real success.</p>
<p>We had a great lineup of speakers for the first topic in the series: data visualization. Following our post on <a href="http://metamarkets.com/2011/the-rise-of-dynamic-data-visualization/">the rise of interactive data visualization</a>, we decided to bring together some of the people designing visualizations as well as the people behind the frameworks used to build them, so they could share with us some of the projects they are working on and how they approach the problems they are trying to solve.</p>
<h2 dir="ltr">At the crossroads of art &amp; science, design &amp; engineering</h2>
<p dir="ltr">People working in data visualization come from various different backgrounds and it is interesting to see how they embrace the engineering and design challenges involved. We see engineers becoming designers as well as designers embracing the engineering side of data visualization. At many levels it is both an art and a science, and the variety of people who attended the salon are a great example of that.</p>
<p dir="ltr"><a href="http://www.mbecicadesign.com/">Mary Becica</a> described how her architecture background influences how she approaches data visualization problems. She starts by putting the data into context, and letting that context inform the visual representation to give the data. Too often people start with a preconceived visual without giving extra thought to what form the data should take.</p>
<p dir="ltr"><span id="more-718"></span></p>
<p dir="ltr"><a href="http://mike.teczno.com/">Mike Migurski</a> gave us a glimpse into <a href="http://www.openstreetmap.org/">OpenStreetMap</a> and the various projects he is working on to help create maps that go beyond a standard Google maps overlay. His objective is to design maps that are suitable to many types of overlays without distracting from or interfering with the information that is being surfaced. Again, setting the context is fundamental and the work that goes into the base map layer plays an important role in defining that context. From the subtle color palettes used to enhance topographic relief all the way to labeling heuristics, striking the right balance is very much an art, and scaling that to a wide range of resolutions and sizes requires quite a bit of engineering wizardry.</p>
<h2 dir="ltr">Telling a story</h2>
<p dir="ltr">Whether designing an interactive book or building dynamic infographics, we ultimately want to tell a story through the data, and visualization is the key to how we expose those insights.</p>
<p dir="ltr"><a href="http://worrydream.com/">Bret Victor</a> walked us through the playful interactive graphics he created for <a href="http://itunes.apple.com/us/app/id432753658?mt=8">“Our Choice”</a>,  Al Gore’s interactive book on climate change. Bret shared with us his insights on how to engage the audience and guide them through the data with limited effort, reducing the frictions to discovery and learning.</p>
<p dir="ltr">With the ability to create ever richer interactive experiences it becomes critical not to drown the user in gratuitous forms of interaction, but focus on essential mechanisms that help to understand and explore the data. This becomes even more important when designing for mobile interfaces. Mobile devices allow for richer types of interactions, but they are often less obvious to the user. It is critical to provide hints and visual cues so the user can maximize their experience of interactive graphics.</p>
<p dir="ltr">Expanding on this idea, <a href="http://www.nickbilton.com/">Nick Bilton</a> recounted some of his experiences at the New York Times R&amp;D labs. There he was able to leverage the trove of data available at the New York Times to visualize how information spreads through various (social) networks. Through his work Nick has uncovered many interesting and surprising patterns, but sifting through such large volumes of data can be quite a challenge. Recent improvements in frameworks and visualization tools have started to make this easier.</p>
<h2 dir="ltr">Evolution of visualization frameworks</h2>
<p dir="ltr">There is no one better to talk about these frameworks than the authors that created them. We brought together <a href="http://vis.stanford.edu/jheer/">Jeff Heer</a>, <a href="http://bost.ocks.org/mike/">Mike Bostock</a> and <a href="http://philogb.github.com/">Nicolas Garcia Belmonte</a> for a freeform panel to discuss everything from <a href="http://prefuse.org/">prefuse</a>, <a href="http://mbostock.github.com/protovis/">protovis</a> and <a href="http://mbostock.github.com/d3/">d3</a> to <a href="http://thejit.org/">infovis</a> and <a href="http://philogb.github.com/philogl/">philogl</a>. It was a great occasion to learn about what motivated them, the evolution of the different frameworks and the type of problems they were trying to solve along the way.</p>
<p dir="ltr">Building on this line of toolkits, Vadim Ogievetsky has developed <a href="https://github.com/vogievetsky/DVL">DVL</a>, a reactive visualization framework. He demonstrated how easy it can be to build multi-faceted visualizations for high-dimensional real-time data feeds. The toolkit abstracts many of the dependencies between data and representation, without losing the flexibility of the underlying graphical framework. His work is a key element to the Metamarkets platform.</p>
<h2 dir="ltr">Looking forward to data salon #2</h2>
<p dir="ltr">All said, this event was a very enriching experience, thanks to the impressive list of attendees and fantastic presenters. We also want to thank all of you who helped out, including Nisha Pathak for handling the logistics. We had an excellent response to this salon and are looking forward to hosting more events like this.</p>
<p dir="ltr">For more information about the presenters’ work check out their sites here:</p>
<ul>
<li>Bret Victor <a href="http://worrydream.com/">http://worrydream.com/</a></li>
<li>Nick Bilton <a href="http://www.nickbilton.com/">http://www.nickbilton.com/</a></li>
<li>Mike Migurski <a href="http://mike.teczno.com/">http://mike.teczno.com/</a></li>
<li>Mary Becica <a href="http://www.mbecicadesign.com/">http://www.mbecicadesign.com/</a></li>
<li>Jeff Heer <a href="http://vis.stanford.edu/jheer/">http://vis.stanford.edu/jheer/</a></li>
<li>Mike Bostock <a href="http://bost.ocks.org/mike/">http://bost.ocks.org/mike/</a></li>
<li>Nicolas Garcia Belmonte <a href="http://philogb.github.com/">http://philogb.github.com/</a></li>
<li>Vadim Ogievetsky <a href="https://github.com/vogievetsky/DVL">https://github.com/vogievetsky/DVL</a></li>
</ul>
</div>
]]></content:encoded>
			<wfw:commentRss>http://metamarkets.com/2012/data-salon-1-visualization/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Milestones and Transitions</title>
		<link>http://metamarkets.com/2012/milestones-and-transitions/</link>
		<comments>http://metamarkets.com/2012/milestones-and-transitions/#comments</comments>
		<pubDate>Thu, 12 Jan 2012 15:47:46 +0000</pubDate>
		<dc:creator>David Soloff</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://metamarkets.com/?p=651</guid>
		<description><![CDATA[Milestones Mike Driscoll and I founded Metamarkets in 2009 to provide large scale analytics to media companies, and I’m very gratified we have reached our objectives. Over the past 18 months, Metamarkets has built and shipped some of the most &#8230; <a href="http://metamarkets.com/2012/milestones-and-transitions/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p><strong>Milestones</strong></p>
<p>Mike Driscoll and I founded Metamarkets in 2009 to provide large scale analytics to media companies, and I’m very gratified we have reached our objectives. Over the past 18 months, Metamarkets has built and shipped some of the most scalable, cutting-edge data analytics infrastructure in the marketplace. Our engineering and product teams have designed, built and deployed from scratch a cloud-hosted integrated analytics stack on behalf of market-leading, web-scale media businesses.</p>
<p>As we have pushed into the market, we have seen our product scale to accommodate the biggest, fastest-moving event sets on the internet. The core value proposition for our product has been validated through our customer set: the Metamarkets stack eliminates the need for a business to integrate multiple disparate software solutions at the data ingestion, database, analytics and visualization layers. This is something revolutionary for the data analytics industry.</p>
<p>The speed, scalability and usability advantages of our integrated platform are apparent to our customers. Our goal is to deliver up-to-the-minute, quantitative, operational intelligence about a business’s transaction streams, at a scale previously incomprehensible, at a cost effective price point, and for a broad customer set. In short, it’s been our goal to enable our partners to interact with their critical transaction events data when, where and how they choose; as the first thing they check when they wake up; as the last thing they check before going to sleep. Our mobile and tablet product now enable our customers to analyze their data in the middle of the night.</p>
<p>Without a doubt, our team’s stunning product achievements in 2011 are the accomplishment of which I’m most proud. We have been true to our original vision, and now lead the market for hosted, web-scale Business Intelligence for the global media and advertising industries. Now that Metamarkets has established its offer in an initial set of verticals, we can turn to address the broader emergent market opportunity, one that both Mike and I can honestly state is far larger than our initial conception when we launched the company a couple of years ago.</p>
<p><span id="more-651"></span><br />
<strong>The Big Swing</strong></p>
<p>Metamarkets sits at the intersection of three megatrends: Big Data, analytics, and cloud computing. We believe this opportunity must be seized quickly and aggressively, and so in 2012, Metamarkets will extend our platform into adjacent, data-intensive verticals whom we know are hungry for our brand of cost-effective, scale analytics.</p>
<p>A world leading company in big data analytics requires appropriate leadership given the specialized technical nature of the opportunity. I’m incredibly grateful and proud to announce that this is the time for Mike Driscoll, my Metamarkets co-founder and CTO, to take over as Metamarkets’ CEO. Mike is a world-class technical talent and rising star in data software, one of the most intensely brilliant technical product minds I’ve encountered in over 15 years of financial and data software work, a colleague whom I’ve had the great good fortune to call co-founder, and with whom I’ve worked shoulder-to-shoulder to build a great foundation over the past two years. I have the utmost confidence in Mike’s abilities, vision and instincts to take Metamarkets to the next level (or two, or three). Mike will take our initial vision of a distributed, fast, analytics service, and grow Metamarkets into something far more transformative than anybody could have imagined a couple of years ago.</p>
<p><strong>Transitions</strong></p>
<p>Nurturing and launching a company is very different from scaling one. I have spent three years getting Metamarkets off the ground, the past two in bringing our product into the market. In this time, we’ve built a company and put the foundational product and engineering teams in place. And so the time is right for me to return to my passion: the early stages of product ideation and new venture formation. I’m extremely fortunate to be joining a leading early-stage VC fund as Venture Partner and EIR. I have a number of ideas I am considering, though there is one I am particularly passionate about launching in 2012 – more about my next steps at a later date.</p>
<p>Most importantly, despite stepping away from the CEO’s chair, I will continue to serve actively on Metamarkets’ Board of Directors, and I’ve coordinated with Mike to continue to advise the Metamarkets product and sales teams on their approach to the media and advertising markets. Our current partners and customers will experience no change in their service or interactions with the company. We continue to aim high and service our partners at the same exceptional levels we established in 2011. I’m very much looking forward to helping Metamarkets continue to grow, while also advancing some of my own personal career objectives. The time is right for this transition, and Mike and I have embraced it.</p>
<p>It’s been a great to get Metamarkets out of the garage, moving along local streets, following access roads, and now onto the highway. We can see the wide, fast road ahead, and I for one can feel the car accelerating…</p>
<p>DS</p>
<p><strong>The Road Ahead</strong></p>
<p>When David and I began our start-up journey together in 2009, in a windowless office on Townsend Street, I doubt if either of us could have conceived of the successes (nor of the sleepless nights) that the subsequent years would bring.  We are fortunate to have assembled a team of extraordinary engineers, whose dedicated efforts have yielded a revenue-generating product serving a growing cast of engaged customers.</p>
<p>I am proud of what we have accomplished to date under David's leadership as CEO.  I look forward to having his unparalleled intellect and instincts at work for us as we move forward.  And I'm humbled by the opportunity to build on his achievements and begin scaling our organization to match the scope of our market opportunity.</p>
<p>To that end, today I am announcing that Ken Chestnut, a former executive at MarkLogic and Siebel Systems, will be joining Metamarkets to lead our marketing efforts.  Ken brings over 15 years of experience in the management, strategy, and marketing of business software tools.  Ken joins Charlene Son Rigby, a former Oracle executive, who leads our sales and operations and joined us two months ago.</p>
<p>Working with our engineering leadership, Ken and Charlene will help us capitalize on our core technology: our in-memory data store, analytics engine, and real-time, interactive dashboard.</p>
<p>As we will describe in an upcoming blog post, we have scaled our cloud-based data and compute infrastructure by a factor of 100 since bringing our first customer onboard over a year ago.  As of last week, we clocked a new performance record: a query that processed 26 billion records in under a second.</p>
<p>It is time for Metamarkets to double-down on our technology, address verticals beyond the digital media markets, and lengthen our lead as a pioneer in big data solutions.</p>
<p>As I shift from my role as CTO, I aim not only to maintain but to strengthen the engineering-driven culture that has gotten us to this stage.</p>
<p>MD</p>
]]></content:encoded>
			<wfw:commentRss>http://metamarkets.com/2012/milestones-and-transitions/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Beyond Hadoop:  Fast Queries from Big Data</title>
		<link>http://metamarkets.com/2011/hadoops-secret-shortcoming-speed-and-how-to-fix-it/</link>
		<comments>http://metamarkets.com/2011/hadoops-secret-shortcoming-speed-and-how-to-fix-it/#comments</comments>
		<pubDate>Fri, 04 Nov 2011 10:18:30 +0000</pubDate>
		<dc:creator>mike</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://metamarketsgroup.com/blog/?p=584</guid>
		<description><![CDATA[There's an unspoken truth lurking behind the scourge of Big Data and the heralding of Hadoop as its savior: while Hadoop shines as a processing platform, it is painfully slow as a query tool. Hive was developed by the folks &#8230; <a href="http://metamarkets.com/2011/hadoops-secret-shortcoming-speed-and-how-to-fix-it/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>There's an unspoken truth lurking behind the scourge of Big Data and the heralding of Hadoop as its savior:  while Hadoop shines as a processing platform, it is painfully slow as a query tool.</p>
<p>Hive was developed by the folks at Facebook in 2008, as a means of providing an easy-to-use, SQL-like query language that would compile to MapReduce code.  A year later, Hive was responsible for <a href="http://borthakur.com/ftp/hadoopworld.pdf">95% of the Hadoop jobs</a> run on Facebook's servers.  This is consistent with another observation made by Cloudera's Jeff Hammerbacher: when Hive is installed on a client's Hadoop cluster, <a href="http://www.dataspora.com/2009/11/sql-is-dead-long-live-sql/"> its overall usage increases tenfold.</a></p>
<p>That data-heavy businesses can achieve visibility into the terabytes of logs that they generate is, at a primary level, a major step forward. Before the Hadoop era, this was difficult to impossible without a major engineering investment.  Thus Hadoop has solved the challenge of economically processing data at scale.  Hive has solved the challenge of hand-writing Hadoop queries.</p>
<p>But there remains a painful challenge that Hive and Hadoop does not solve for: speed.</p>
<p><strong>A Powerful But Lumbering Elephant</strong><br />
<span id="more-584"></span><br />
Hadoop does not respond anywhere close to "human time", a term <a href="http://radar.oreilly.com/2011/09/evolution-of-data-products.html"> that describes response thresholds </a> acceptable to a human user, typically on the order of seconds.  Larry Ellison and his marketing mavens invoke a similar theme when pitching their wares as "analytics at the speed of thought."</p>
<p>Nonetheless, this sluggishness is not the fault of Hive or Hadoop per se.  If a business user asks a question about a year's worth of data with Hive, a set of MapReduce jobs will dutifully scan and process, in parallel, terabytes of data to obtain the answer.  It's neither the commodity hardware that most Hadoop clusters use nor <a href="http://developer.yahoo.com/blogs/hadoop/posts/2009/08/the_anatomy_of_hadoop_io_pipel/"> some of its IO indulgences </a> while executing processes, that are to blame. These are the low-order performance bits.</p>
<p>And while Hadoop jobs do have a fairly constant overhead -- with a lower bound in the range of 15 seconds -- this is often considered trivial within the context of the minutes or hours that most full jobs are expected to take.</p>
<p>The higher-order bits affecting query performance are: (i) the size of the data being scanned, (ii) the nature of storage, e.g. whether it is kept on disk or in memory, and (iii) the degree of parallelization.</p>
<p><strong>An Emerging Design Pattern:  Distill, then Store </strong></p>
<p>As a result, a common design pattern is emerging among data-heavy firms: Hadoop is used as a pre-processing tool to generate summarized data cubes, which is then loaded into an in-memory, parallelized database -- be it Oracle Exalytics, Netezza, Greenplum or even <a href="http://corp.klout.com/blog/2011/11/big-data-bigger-brains/">Microsoft SQL Server</a>.  Occasionally, a traditional database query layer can be bypassed altogether, and summary data cubes can be loaded directly into a desktop analytics tool such as Qlikview, Spotfire, or Tableau.</p>
<p>At Metamarkets, we have embraced this design pattern and the role that Hadoop plays in preparing data for fast queries.  Our particular bag of tricks is best described by the <a href="http://metamarketsgroup.com/blog/druid-part-i-real-time-analytics-at-a-billion-rows-per-second/">three principles of Druid</a>:</p>
<ul>
<li><strong>Distill:</strong>  We roll data data up the coarsest grain at which a user might have reasonable interest.  Put simply, it is rare that one is concerned with individual events at one-second time frames.  Rolling up to groups of events, with a select set of dimensions and at minutely or hourly granularity, can distill raw data's footprint down to 1/100th of its original size.
</li>
<li><strong>Distribute:</strong> While this summarized data is spread across multiple nodes in our cluster, the queries against this data are also distributed and parallelized.  In our quest to break into the "human time" threshold, we have increased this parallelization to as many as 1000 cores, allowing each query to hit a large percentage of nodes on our cluster.  In our experience, CPUs are rarely the bottleneck for systems serving human clients, even for a cluster serving hundreds of users concurrently.</li>
<li><strong>DRAM:</strong> We share Curt Monash's sentiment that <a href="http://www.dbms2.com/2011/05/23/databases-ram/"> traditional databases will eventually end up in RAM </a>, as memory costs continue to fall.  In-memory analytics are popular because they are fast, often 100x to 1000x faster than disk.  This dramatic performance kick is what makes Qlikview such a popular desktop tool.
</li>
</ul>
<p>The end result of these three techniques, each of which independently delivers between a 10 and 1000-fold improvement, is a platform that can run in seconds what previously took minutes or even hours in Hive.</p>
<p>This approach, for which <a href="http://blog.aggregateknowledge.com/2011/09/08/our-approach/">we know we are not alone</a> in pursuing, achieves performance that exceeds or matches any of the <a href="http://gigaom.com/cloud/why-oracles-big-boxes-are-on-the-wrong-side-of-history/">big box retailers</a> at a considerably lower price point.</p>
<p>The commoditization wave that began with massive data processing, initiated by Hadoop, is migrating upwards towards query architectures. Thus the competitive differentiators are shifting away from large-scale data management and towards what might be called Big Analytics, where the next battle for profits will be fought.</p>
]]></content:encoded>
			<wfw:commentRss>http://metamarkets.com/2011/hadoops-secret-shortcoming-speed-and-how-to-fix-it/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>&quot;Designing Futures Where Nothing Will Occur&quot;:  The Art of Forecasting</title>
		<link>http://metamarkets.com/2011/designing-futures-where-nothing-will-occur/</link>
		<comments>http://metamarkets.com/2011/designing-futures-where-nothing-will-occur/#comments</comments>
		<pubDate>Tue, 02 Aug 2011 06:00:25 +0000</pubDate>
		<dc:creator>Joe Reisinger</dc:creator>
				<category><![CDATA[machine learning]]></category>
		<category><![CDATA[technology]]></category>

		<guid isPermaLink="false">http://metamarketsgroup.com/blog/?p=470</guid>
		<description><![CDATA[One of the key analytics products we provide is robust time-series forecasting over multidimensional faceted data, which requires giving users access to a number of predictions exponential in the base data dimensionality. For online publishers, such forecasting includes simple metrics &#8230; <a href="http://metamarkets.com/2011/designing-futures-where-nothing-will-occur/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p><span style="font-weight: normal;">One of the key analytics products we provide is robust time-series forecasting over multidimensional faceted data, which requires giving users access to a number of predictions exponential in the base data dimensionality. For online publishers, such forecasting includes simple metrics like impressions, revenue and actions, and more generally we might want to model CTR or more complex metrics like the opportunity cost of showing house ads versus sponsorship ads. </span></p>
<p>Existing forecasting systems tend to exist inside of traditional OLAP storage offerings that embed a limited set of statistical functions, many of which are poorly optimized and slow to execute. The machine learning renaissance of the last decade has been tempered by powerful additions from the algorithms and distributed systems communities: <a href="http://hunch.net/?p=273">compressed sensing</a>, <a title="hash representations" href="http://hunch.net/~jl/projects/hash_reps/index.html">feature hashing</a> and <a href="http://www.quora.com/What-are-some-examples-of-the-use-of-machine-learning-in-distributed-systems">feature sharding</a> across multiple cores have significantly lowered computational costs, while advances in regularization and <a title="structured sparsity" href="https://sites.google.com/site/icml2011sparsity/">structured sparsity</a> from applied statistics have likewise increased model robustness, allowing unprecedented modeling of high-degree feature interactions.</p>
<p>In this post I'll outline some of the engineering challenges behind putting a large-scale machine learning system into production, focusing less on actual learning algorithm implementations (as these are fairly commoditized at this point: <a title="mahout" href="http://mahout.apache.org/">mahout</a>, <a title="vowpal wabbit" href="https://github.com/JohnLangford/vowpal_wabbit/wiki">vowpal wabbit</a>, <a title="scikits.learn" href="http://scikit-learn.sourceforge.net/stable/">scikits.learn</a>, etc) and instead addressing our engineering architecture. In particular we'll look at time-series forecasting, separate from more generic prediction, where our end goal is to create forecasts <a href="http://www.blackbird.vcu.edu/v5n2/poetry/plath_s/ennui.htm">where nothing interesting occurs</a>.*<br />
<span id="more-470"></span></p>
<h2>Forecasting system</h2>
<p>For forecasting, we combine predictions from two separate subsystems: a top-down "structural" model and a bottom-up "cross-correlation" model.</p>
<p><strong>Structural model</strong><br />
Top-down models capture long-term trends and other temporal correlations:</p>
<p style="text-align: center;"><a href="/wp-content/uploads/2011/08/11.png"><img class="size-large wp-image-490 aligncenter" style="margin: 0px;" src="/wp-content/uploads/2011/08/11-1024x188.png" alt="" /></a></p>
<p style="text-align: center;"><a href="/wp-content/uploads/2011/08/21.png"><img class="size-large wp-image-497 aligncenter" style="margin: 0px;" title="iPad" src="/wp-content/uploads/2011/08/21-1024x188.png" alt="" /></a></p>
<p>For example, some of our clients tend to see more traffic on certain sites during the Academy Awards, but less on Easter Sunday (unless they're living in Japan or Hong Kong, etc). Through several variants of this model, we can also address transient data anomalies:</p>
<p style="text-align: center;"><a href="/wp-content/uploads/2011/08/31.png"><img class="size-large wp-image-494 aligncenter" style="margin: 0px;" src="/wp-content/uploads/2011/08/31-1024x188.png" alt="" /></a></p>
<p>and structural changes in baseline and trend, such as new site layouts that affect aggregate traffic flows:</p>
<p style="text-align: center;"><a href="/wp-content/uploads/2011/08/41.png"><img class="size-large wp-image-499 aligncenter" style="margin: 0px;" src="/wp-content/uploads/2011/08/41-1024x188.png" alt="" /></a><a href="/wp-content/uploads/2011/08/51.png"><img class="size-large wp-image-500 aligncenter" style="margin: 0px;" src="/wp-content/uploads/2011/08/51-1024x188.png" alt="" /></a></p>
<p>How these effects are integrated into forecasts depends on the semantics of the underlying metric: for online advertising impressions we may want to simply predict "baseline" guaranteed traffic; when forecasting price, we may be more interested in the structure of transient changes.</p>
<p><strong>Cross-correlation model</strong></p>
<p>Bottom-up large-scale models capture fine-grained atemporal feature interactions. This approach trades model flexibility for brute force data mining power, capturing high-order correlations between surface data, such as city x site effects and audience x time-of-day effects. An example of what this model captures is that visitors with New York City and London IP addresses are more likely to consume financial news content on weekday mornings.</p>
<p>For learning, we make use of stochastic gradient descent for parameter updates, and implicit high-order feature generation to capture cross-correlations combined with aggressive regularization and low bit-order feature hashing to combat overfitting and keep updates efficient. These choices have become the de-facto state of the art for computational efficiency and prediction quality, powering systems such as Google's <a title="sybil" href="http://2010.ladisworkshop.org/node/10#keynote1">Sybil</a> and <a title="vowpal wabbit" href="https://github.com/JohnLangford/vowpal_wabbit/wiki">Vowpal Wabbit</a> (cf. <a title="lccc" href="http://lccc.eecs.berkeley.edu/">learning on cores, clusters and clouds</a> workshop at NIPS).</p>
<p>We also built an expressive feature specification framework directly into our configuration language; it is well-established that feature engineering often has a much larger impact on predictive quality than choice of learning algorithm (cf. the <a href="http://www2.research.att.com/~volinsky/netflix/ProgressPrize2007BellKorSolution.pdf">BellKor</a> Netflix Solution, <a href="http://pslcdatashop.org/KDDCup/workshop/">KDD Cup 2010</a> discussions, the <a href="http://nlp.stanford.edu/IR-book/html/htmledition/features-for-text-1.html">Stanford IR Book</a>).</p>
<h2>System-level overview</h2>
<p>In addition to generating state of the art forecasts, we identified three main engineering desiderata:</p>
<h3><strong>Scalable</strong></h3>
<p>Our underlying model framework must (1) be able to cope with multiple terabytes of data per client, and (2) gracefully degrade, automatically prioritizing the most important aspects of prediction over less important ones (maintain high precision / manipulate recall), (3) parallelizable across a commodity hadoop cluster.</p>
<h3><strong>Incremental </strong></h3>
<p>Fresh data arrives daily / hourly / or even realtime, and as such we cannot waste computational resources constantly retraining the predictive models from scratch. However, online incremental learning suffers from convergence effects and drift in the underlying data. Therefore we took a hybrid approach: we couple an incremental model, updated as data arrives with an introspective changepoint detection system for forcing full model rebuilds. This approach also gives us more scaling flexibility: incremental updates run in O(hours) while batch retrains run in O(days).</p>
<h3><strong>Modular</strong></h3>
<p>Generic machine learning efforts <a href="/2011/machine-learning-in-wonderland/">ultimately fail</a>; so we acknowledge the need for continual model development and backtesting. Hence we desire a system with simple, modular configuration for tuning model parameters and error metrics and clean component architecture. Our current component set includes multiple data ingestion pipelines (+ automated data versioning and incremental cacheing along the way) and several different high-level forecasting models that take into account structural changepoints, conditional heteroscedasticity, and long-term trends.</p>
<p>We also support additional covariance models as overlays, for example audience tags overlaid with site usage data, via online matrix factorization (think: collaborative filtering / Netflix prize). These models capture structure that is orthogonal to the underlying forecast, but nonetheless important from a user perspective. For instance, during the Women's World Cup / Olympics, gender-based consumption of sports content exhibits strong shifts.</p>
<h2>The Future</h2>
<p>Despite the effort put into "future-proofing" the forecasting system, there will inevitably be features that force us to rethink our design. Currently we're working to expose the underlying prediction stack more cleanly, in order to reduce friction for surfacing atemporal analytics. Further afield, we intend to address</p>
<ul>
<li>Model integration / combining outputs of multiple models hierarchically. E.g., integrated supply and demand forecasting.</li>
<li>Automated prior estimation for new data facets.</li>
<li>Automated prediction validation scoring framework (independent of prediction source).</li>
</ul>
<p>[*] "<a href="http://www.blackbird.vcu.edu/v5n2/poetry/plath_s/ennui.htm">Ennui</a>" -- with apologies to Sylvia Plath.</p>
]]></content:encoded>
			<wfw:commentRss>http://metamarkets.com/2011/designing-futures-where-nothing-will-occur/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The Rise of Interactive Data Visualization</title>
		<link>http://metamarkets.com/2011/the-rise-of-dynamic-data-visualization/</link>
		<comments>http://metamarkets.com/2011/the-rise-of-dynamic-data-visualization/#comments</comments>
		<pubDate>Tue, 28 Jun 2011 13:37:31 +0000</pubDate>
		<dc:creator>mike</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://metamarketsgroup.com/blog/?p=406</guid>
		<description><![CDATA[The visualization below highlights something only recently possible on the web: a dynamic, interactive canvas. Titled "Disaster Strikes: A World In Sight", it visualizes a century of floods, fires, droughts, and earthquakes around the globe. (Below is a snapshot of &#8230; <a href="http://metamarkets.com/2011/the-rise-of-dynamic-data-visualization/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>The visualization below highlights something only recently possible on the web: a dynamic, interactive canvas.  Titled <a href="http://disaster.mmx-dns.com">"Disaster Strikes: A World In Sight"</a>, it visualizes a century of floods, fires, droughts, and earthquakes around the globe.  (Below is a snapshot of 1996, an apparently costly year for disasters).</p>
<p>It's not a passively animated graphic, but one that users can actively engage with, freezing or pivoting dimensions to reveal new views of the data.  It's a harbinger of a new class of documents, which digital publishers are beginning to embrace, to provide a richer information experience for readers.</p>
<p><a href="http://disaster.mmx-dns.com"><img src="/wp-content/uploads/2011/06/Disaster_Strikes-1024x844.png" alt="" title="Disaster_Strikes" width="640" height="527" class="alignleft size-large wp-image-422" /></a></p>
<h2>Meet the Interactive Frameworks</h2>
<p><span id="more-406"></span><br />
That the above graphic could be built in a single weekend (it was part of a larger hackathon called <a href="http://datainsightsf.com/">Data In Sight</a> that <a href="http://www.twitter.com/schloerke"> Barret Schloerke </a> and <a href="http://datainsightsf.com/teams/">his team 13 </a> participated in) is testament to the maturity of tools available.</p>
<p>In the last few years, there has been a blossoming of frameworks for creating rich, dynamic infographics.  These include <a href="http://processing.org/">Processing</a> (and <a href="http://processingjs.org/">Processing.js</a>), <a href="http://www.adobe.com/products/flex/">Adobe Flex</a>, <a href="http://raphaeljs.com/">Raphael</a>, <a href="http://prefuse.org/">Prefuse and Flare</a>, <a href="http://vis.stanford.edu/protovis/">Protovis</a>, and now <a href="https://github.com/mbostock/d3">D3</a>, among others.</p>
<p>These frameworks present new possibilities for data visualization, but also challenges.  Expose too little interaction, and one risks being little different than a static visualization. Expose too much interaction, and the user is overwhelmed by a jumble of buttons and sliders, with no clear narrative path.</p>
<p>Used well, interaction is a means to escape flatland.  In the Disaster Strikes graphic, for example, it is used to encode an additional two dimensions of data (disaster metric and time) beyond the three that are possible with a heatmap (in this case, country and disaster class on the axes, and magnitude at points inside the matrix).  This allows the graphic to express five dimensions of disaster.</p>
<p>The <em>Disasters</em> heatmap leveraged two tools that are also used internally at Metamarkets:</p>
<ul>
<li><a href="https://github.com/mbostock/d3"> D3 (short for "data driven documents") </a> - the sequel to <a href="http://vis.stanford.edu/protovis/"> Protovis </a></li>
<li>DVL (short for "dynamic visualization LEGOs") - a framework for building event-driven web pages, developed by our very own Vadim Ogievetsky</li>
</ul>
<p>The short but painful history of several rich web toolkits provides lessons to one's choice of interactive visualization tools: choose those that work well with web standards and the DOM.  This favors Javascript frameworks, such as Processing.js and D3, over those that may rely on browser plug-ins (yes, I am <a href="http://metamarketsgroup.com/blog/node-js-and-the-javascript-age/">biased about Javascript</a>).</p>
<p>Next I turn to some of the challenges of designing interactive visualizations, namely working with time, revealing stories, and surfacing state.</p>
<h2>Visualize Time as a Flow, not a Flicker</h2>
<p>Evolving a visualization in time is a powerful technique that should used with care.  Displaying discrete jumps of data over time can be disconcerting for a viewer, making it hard to follow patterns.  One valuable way to address this challenge is to smear time: let events fade into the past, rather than showing only a fast flicker of the present.  Providing a ghosting of the recent past, where data flows, can provide a wider temporal context for otherwise diffuse events (as is often the case for points on a map).  This has been used with success by Aaron Koblin's <a href="http://www.aaronkoblin.com/work/flightpatterns/">Flight Patterns</a> as well as Stamen Design's <a href="http://cabspotting.org/timelapse.html">Cabspotting</a>.</p>
<p>Visualizing data as a flow is more than just aesthetically pleasing; <a href="http://ccom.unh.edu/vislab/PDFs/Ware_FlowTheory.pdf">recent work by Colin Ware </a> suggests it may be a better way to encode temporal data, given our eyes' natural aptitude for perceiving continuous contours.</p>
<h2> The Power of Story, The Joy of Discovery </h2>
<p>John Lasseter of Pixar has said "No amount of great animation will save a bad story."  Likewise, it isn't enough for data visualizations to look beautiful: to succeed, they must tell a compelling story.</p>
<p>In the case of dynamic, interactive visualizations, this can be a challenge.  Most possible states of a visualization are simply uninteresting.  The key for the information designer is to constrain exploration along paths that are most likely to yield insights.  Most data are sparse and long tailed, so curating and narrowing dimensions (don't let outliers warp your axes) can help restore some information density.</p>
<p>For an example of this, witness the New York Times <a href="http://www.nytimes.com/interactive/2010/01/10/nyregion/20100110-netflix-map.html"> visualization of Netflix Queues </a>.  Rather than a full choropleth of the United States, twelve metropolitan areas were preferentially selected, and a set of movies with distinct rental patterns were helpfully linked to at the top of the page.</p>
<p>Applying algorithms to reorder the data, such as placing similar data points together, can also help guide users towards discovering patterns.  The Disaster Strikes visualization includes a "cluster" button which executes a bivariate clustering algorithm in Javascript, revealing countries that have suffered similar kinds of disasters (Korea, Ecuador and Guatemala all suffer from tropical storms).</p>
<p>The upside of a visualization with an exploratory state space is that users experience the joy of discovery.  Jeff Heer's <a href="http://vis.stanford.edu/papers/senseus"> Sense.US </a> allowed users to explore US Census information over the past century and debate the meanings of discovered trends.</p>
<h2>Encoding State with a Stateless Protocol </h2>
<p>One important point in such visualizations is that state be encoded in the browser URL.  When you find something interesting, for example that 1996 was a bad year for bacterial outbreaks, you should be able to share it.  Updating the base URL is not wise, as it would require a painful reload of the page for each interaction.  The recommended approach is using a hashbang fragment (<a href="http://blog.benward.me/post/3231388630">but be careful</a>), which can be detected to set the visualization to the proper state.</p>
<p>The <em>Disaster Strikes</em> visualization does not surface state yet, but I expect with more time it would have been implemented, and we make extensive use of this with our internal dashboards.</p>
<h2>Progress, In Sight</h2>
<p>The field of interactive visualization is still nascent, but this weekend's event was testimony that static visualizations are likely to go the way of printed books.   Though "interactive" was just one of the four award categories, every single team I witnessed on Sunday showed what would be considered an interactive, dynamic visualization.</p>
<p><em> Some final notes about the competition:  The world disaster data, as well as other data sets that were eligible for use at the hackathon, are available at <a href="http://www.infochimps.com/tags/datainsightsf"> this special InfoChimps page. </a>.  One specific winner <a href="http://metamarkets.com.s140948.gridserver.com/wp-content/uploads/2011/06/barret_jambox.jpg">can be seen here</a>, and the full set of winners can be seen at the <a href="http://datainsightsf.com/"> Data Insights home</a>.</em>.</p>
<p><em> Interested in creating data visualizations like this one?  <a href="http://www.metamarkets.com/jobs/"> Come join the Metamarkets Team</a>. </em></p>
]]></content:encoded>
			<wfw:commentRss>http://metamarkets.com/2011/the-rise-of-dynamic-data-visualization/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Druid, Part Deux: Three Principles for Fast, Distributed OLAP</title>
		<link>http://metamarkets.com/2011/druid-part-deux-three-principles-for-fast-distributed-olap/</link>
		<comments>http://metamarkets.com/2011/druid-part-deux-three-principles-for-fast-distributed-olap/#comments</comments>
		<pubDate>Fri, 20 May 2011 09:18:58 +0000</pubDate>
		<dc:creator>Eric Tschetter</dc:creator>
				<category><![CDATA[Druid]]></category>
		<category><![CDATA[technology]]></category>

		<guid isPermaLink="false">http://metamarketsgroup.com/blog/?p=382</guid>
		<description><![CDATA[In a previous blog post we introduced the distributed indexing and query processing infrastructure we call Druid. In that post, we characterized the performance and scaling challenges that motivated us to build this system in the first place. Here, we &#8230; <a href="http://metamarkets.com/2011/druid-part-deux-three-principles-for-fast-distributed-olap/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>In a <a href="/2011/druid-part-i-real-time-analytics-at-a-billion-rows-per-second/"> previous blog post </a> we introduced the distributed indexing and query processing infrastructure we call Druid.  In that post, we characterized the performance and scaling challenges that motivated us to build this system in the first place.  Here, we discuss three design principles underpinning its architecture.</p>
<h2> 1. Partial Aggregates + In-Memory + Indexes => Fast Queries </h2>
<p>We work with two representations of our data: <em>alpha</em> represents the raw, unaggregated event logs, while <em> beta </em> is its partially aggregated derivative.  This <em> beta </em> is the basis against which all further queries are evaluated:</p>
<pre style="font-size: smaller;">
2011-01-01T01:00:00Z  ultratrimfast.com  google.com  Male    USA      1800         25      15.70
2011-01-01T01:00:00Z  bieberfever.com    google.com  Male    USA      2912         42      29.18
2011-01-01T02:00:00Z  ultratrimfast.com  google.com  Male    UK       1953         17      17.31
2011-01-01T02:00:00Z  bieberfever.com    google.com  Male    UK       3194         170     34.01
</pre>
<p>This is the most compact representation that preserves the finest grain of data, while enabling on-the-fly computation of all O(2^n) possible dimensional roll-ups.</p>
<p>The key to Druid’s speed is maintaining the <em>beta</em> data entirely in memory.  Full scans are several orders of magnitude faster in memory than via disk.  What we lose in having to compute roll-ups on the fly, we make up for with speed.</p>
<p>To support drill-downs on specific dimensions (such as results for only ‘bieberfever.com’), we maintain a set of inverted indices. This allows for fast calculation (using AND &#038; OR operations) of rows matching a search query.  The inverted index enables us to scan a limited subset of rows to compute final query results – and these scans are themselves distributed, as we discuss next.<br />
 <span id="more-382"></span></p>
<h2> 2. Distributed Data + Parallelizable Queries => Horizontal Scalability </h2>
<p>Druid’s performance depends on having memory -- lots of it.  We achieve the requisite memory scale by dynamically distributing data across a cluster of nodes.  As the data set grows, we can horizontally expand by adding more machines.</p>
<p>To facilitate rebalancing, we take chunks of <em>beta</em> data and index them into segments based on time ranges.  For high cardinality dimensions, distributing by time isn’t enough (we generally try to keep segments no larger than 20M rows), so we have introduced partitioning.  We store metadata about segments within the query layer and partitioning logic within the segment generation code.</p>
<p>We persist these segments in a storage system (currently S3) that is accessible from all nodes.  If a node goes down, <a href=”http://zookeeper.apache.org/”>Zookeeper</a> coordinates the remaining live nodes to reconstitute the missing beta set.</p>
<p>Downstream clients of the API are insulated from this rebalancing: Druid’s query API seamlessly handles changes in cluster topology.</p>
<p>Queries against the Druid cluster are perfectly horizontal.  We limited the aggregation operations we support – count, mean, variance and other parametric statistics – that are inherently parallelizable.  While less parallelizable operations, such as median, are not supported, this limitation is offset by rich support of histogram and higher-order moment stores. The co-location of processing with in-memory data on each node reduces network load and dramatically improves performance.</p>
<p>This architecture provides a number of extra benefits:</p>
<ul>
<li> Segments are read-only, so they can simultaneously serve multiple servers.  If we have a hotspot in a particular index, we can replicate that index to multiple servers and load balance across them.
<li> We can provide tiered classes of service for our data, with servers occupying different points in the "query latency vs. data size" spectrum
<li> Our clusters can span data center boundaries
</ul>
<h2> 3. Real-Time Analytics:  Immutable Past, Append-Only Future </h2>
<p>Our system for real-time analytics is centered, naturally, on time.  Because past events happen once and never change, they need not be re-writable.  We need only be able to append new events.</p>
<p>For real-time analytics, we have an event stream that flows into a set of real-time indexers.  These are servers that advertise responsibility for the most recent 60 minutes of data and nothing more.  They aggregate the real-time feed and periodically push an index segment to our storage system.  The segment then gets loaded into memory of a standard server, and is flushed from the real-time indexer.</p>
<p>Similarly, for long-range historical data that we want to make available, but not keep hot, we have deep-history servers.  These use a memory mapping strategy for addressing segments, rather than loading them all into memory.  This provides access to long-range data while maintaining the high-performance that our customers expect for near-term data.</p>
<h2> Summary </h2>
<p>Druid’s power resides in providing users fast, arbitrarily deep exploration of large-scale transaction data.  Queries over billions of rows, that previously took minutes or hours to run, can now be investigated directly with sub-second response times.</p>
<p>We believe that the performance, scalability, and unification of real-time and historical data that Druid provides could be of broader interest.  As such, we plan to open source our code base in the coming year.</p>
<p><em> Interested in tackling distributed systems challenges like this one?  <a href="http://www.metamarkets.com/jobs/"> Come join the Metamarkets Team</a>. </em></p>
]]></content:encoded>
			<wfw:commentRss>http://metamarkets.com/2011/druid-part-deux-three-principles-for-fast-distributed-olap/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Hacking Hacker News Headlines</title>
		<link>http://metamarkets.com/2011/hacking-hacker-news-headlines/</link>
		<comments>http://metamarkets.com/2011/hacking-hacker-news-headlines/#comments</comments>
		<pubDate>Thu, 05 May 2011 08:39:03 +0000</pubDate>
		<dc:creator>Joe Reisinger</dc:creator>
				<category><![CDATA[fun]]></category>
		<category><![CDATA[machine learning]]></category>

		<guid isPermaLink="false">http://metamarketsgroup.com/blog/?p=256</guid>
		<description><![CDATA[One weekend a few months ago Vad [1] and I were hanging around the new Metamarkets office reading Hacker News.  We noticed something strange: two different headlines, both linking to identical content, resulted in dramatically different popularity ranks.  Do headlines matter so much? What &#8230; <a href="http://metamarkets.com/2011/hacking-hacker-news-headlines/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>One weekend a few months ago <a href="http://vadim.ogievetsky.com/">Vad</a> [1] and I were hanging around the new Metamarkets office reading <a href="http://news.ycombinator.com/">Hacker News</a>.  We noticed something strange: two different headlines, both linking to identical content, resulted in dramatically different popularity ranks.  Do headlines matter so much? What drives observed popularity?</p>
<p>We started to investigate.</p>
<p><a href="http://hn.metamx.com"><img src="http://hn.metamx.com/mx_hacker_news_static_small.jpg" alt="pretty" /></a><br />
(Above: Rolling 10 days of article ranks. <a href="http://hn.metamx.com">Click for an interactive version.</a>)<br />
<span id="more-256"></span><br />
The right way to answer this question was pretty obvious: crunch the data.  We started scraping HN titles along with article ranks and fed the resulting data into our online feature learning stack.</p>
<p>Below is the distilled summary of the result, our "Top Ten Hacker News Headline Hacks," including feature weight, standard error and p-value vs. zero. Positive weight means the feature is predictive of high article rank.</p>
<p><strong>Hack #1:  Maximize Controversy</strong></p>
<blockquote><p><span style="color: #339966;">1.4 ± 0.5</span> [p&lt;1e-5] <strong>| essential</strong><br />
<span style="color: #339966;"> 1.3 ± 0.5</span> [p&lt;1e-5] <strong>could</strong><br />
<span style="color: #339966;"> 1.2 ± 0.4</span> [p&lt;1e-5] <strong>problem</strong><br />
<span style="color: #339966;"> 1.3 ± 0.8</span> [p&lt;1e-5] <strong>survived the</strong><br />
<span style="color: #339966;"> 1.0 ± 0.5</span> [p&lt;1e-5] <strong>controversy</strong><br />
<span style="color: #339966;"> 0.9 ± 0.3</span> [p&lt;1e-5] <strong>impossible</strong><span style="color: #339966;"><br />
</span></p></blockquote>
<p><strong>Hack #2:  Question Authority</strong></p>
<blockquote><p><span style="color: #339966;">0.7 ± 0.2</span> [p&lt;1e-5]<strong> why ____ future</strong><br />
<span style="color: #339966;"> 0.4 ± 1.0</span> [p=0.2] <strong>the ____ behind</strong><br />
<span style="color: #339966;"> 0.2 ± 0.3</span> [p=0.04]<strong> why don't</strong><br />
<span style="color: #339966;"> 0.1 ± 0.3</span> [p=0.06] <strong>| lessons</strong></p></blockquote>
<p><strong>Hack #3:  Avoid False Promises</strong></p>
<blockquote><p><span style="color: #ff0000;">-1.5 ± 0.8 </span>[p&lt;1e-5] <strong>tricks</strong><br />
<span style="color: #ff0000;"> -0.7 ± 0.5</span> [p&lt;1e-5]<strong> the world |</strong><br />
<span style="color: #ff0000;"> -0.7 ± 0.2</span> [p&lt;1e-5] <strong>the greatest</strong><br />
<span style="color: #ff0000;"> -0.6 ± 0.3</span> [p&lt;1e-5] <strong>awesome</strong><br />
<span style="color: #ff0000;"> -0.6 ± 0.7</span> [p=0.003] <strong>anatomy of a</strong><br />
<span style="color: #ff0000;"> -0.5 ± 0.3</span> [p&lt;1e-5]<strong> guide to</strong></p></blockquote>
<p><strong>Hack #4:  Short is Sweet</strong></p>
<blockquote><p><span style="color: #ff0000;">-0.3 ± 0.04</span> [p&lt;1e-5]<strong> {# WORDS}</strong></p></blockquote>
<p><strong>Hack #5:  Execution not Ideas</strong></p>
<blockquote><p><span style="color: #339966;">2.6 ± 2.1 </span>[p&lt;1e-5] <strong>showing</strong><br />
<span style="color: #339966;"> 1.5 ± 0.7</span> [p&lt;1e-5] <strong>| building</strong><br />
<span style="color: #339966;"> 0.6 ± 0.3 </span>[p&lt;1e-5] <strong>makes</strong><br />
<span style="color: #339966;"> 0.5 ± 0.4</span> [p&lt;1e-5] <strong>starting a company</strong><br />
<span style="color: #339966;"> 0.3 ± 0.3</span> [p&lt;1e-3] <strong>join a startup</strong></p>
<p><span style="color: #ff0000;">-1.1 ± 0.3</span> [p&lt;1e-5] <strong>ideas</strong><br />
<span style="color: #ff0000;"> -1.1 ± 0.3</span> [p&lt;1e-5]<strong> idea?</strong></p></blockquote>
<p><strong>Hack #6:  Everybody Loves a Winner</strong></p>
<blockquote><p><span style="color: #339966;">1.7 ± 0.7</span> [p&lt;1e-5] <strong>| ____ acquires</strong><br />
<span style="color: #339966;"> 0.5 ± 0.3</span> [&lt;1e-5] <strong>hire</strong><br />
<span style="color: #339966;"> 0.4 ± 0.7</span> [p=0.02] <strong>worth</strong></p></blockquote>
<p><strong>Hack #7:  Everybody Loves Data</strong></p>
<blockquote><p><span style="color: #339966;">1.9 ± 1.8</span> [p&lt;1e-4]<strong> data |</strong><br />
<span style="color: #339966;"> 0.6 ± 0.8</span> [p=0.004] <strong>data -</strong><br />
<span style="color: #339966;"> 0.5 ± 0.1</span> [p&lt;1e-5] <strong>visualize data in</strong></p>
<p><strong> </strong><span style="color: #ff0000;">-1.3 ± 0.7</span> [p&lt;1e-5]<strong> algorithm</strong></p></blockquote>
<p><strong>Hack #8: Nobody Cares About You</strong></p>
<blockquote><p><span style="color: #ff0000;">-0.2 ± 0.3</span> [p=0.008]<strong> my startup</strong><br />
<span style="color: #ff0000;"> -0.9 ± 0.2</span> [p&lt;1e-5] <strong>silicon valley</strong></p></blockquote>
<p><strong>Hack #9:  Some Topics are Just Miserable</strong></p>
<blockquote><p><span style="color: #ff0000;">-0.4 ± 0.3</span> [p&lt;1e-5] <strong>angry birds</strong><br />
<span style="color: #ff0000;"> -0.2 ± 0.1</span> [p&lt;1e-5] <strong>harry potter</strong><br />
<span style="color: #ff0000;">-0.5 ± 0.4</span> [&lt;1e-4] <strong>taxes</strong><br />
<span style="color: #ff0000;">-1.5 ± 1.0</span> [&lt;1e-5] <strong>downtime</strong></p></blockquote>
<p><strong>Hack #10:  Social is For Losers</strong></p>
<blockquote><p><span style="color: #ff0000;">-0.6 ± 0.9</span> [p=0.007] <strong>social</strong><br />
<span style="color: #ff0000;"> -0.5 ± 0.4</span> [p&lt;1e-4] <strong>gamification</strong><br />
<span style="color: #ff0000;"> -0.3 ± 0.6 </span>[p=0.04] <strong>twitter |</strong><br />
<span style="color: #ff0000;"> -2.4 ± 1.5</span> [p&lt;1e-5] <strong>airbnb</strong></p></blockquote>
<p><strong>Standard disclaimer</strong>: the above coefficients are provided for entertainment purposes only. Feature interactions in text are a bitch. Correlation does not imply causation. Past performance does not guarantee future success.</p>
<p><strong>How We Did It</strong></p>
<p>We extracted n-gram (e.g. “Harry Potter”, “Google”, “Silicon Valley”) and skip features (e.g., “a ____ for”,  ”| ____ acquires”) for each title, including start- and end-of-sentence markers and optionally punctuation. For learning we used boosted stochastic gradient descent with logistic loss [2], predicting whether the article made it to the top 20 or not during its observed lifetime. Strong regularization was used to eliminate spurious features, and twenty bootstrap replicates were used to measure significance of coefficients and classification accuracy.</p>
<p>For this untuned, first-pass model, we achieved 64% classification accuracy on a hold out set over the past two months. Positive predictive value was 25.7%, negative predictive value was 73.1%, sensitivity was 18.2% and specificity was 80.9% [3]. Despite this weak predictive power, we found some interesting correlations, more of which we'll release as the model improves.</p>
<p>[1] Of <a href="http://www.koalastothemax.com/">Koalas to the Max</a> fame.<br />
[2] Think: <a href="https://github.com/JohnLangford/vowpal_wabbit/wiki">wabbit style</a>.<br />
[3] <a href="http://en.wikipedia.org/wiki/Positive_predictive_value">Predictive diagnostics.</a></p>
]]></content:encoded>
			<wfw:commentRss>http://metamarkets.com/2011/hacking-hacker-news-headlines/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
	</channel>
</rss>

