Here at Metamarkets, we help our customers quickly make sense of big data sets. To put our solution to the test, I thought it would be interesting to analyze events surrounding the Wikipedia blackout on January 18th in protest of SOPA.
Let's look at edit activity before and after the blackout:
Yesterday evening Romy Misra from visual.ly invited us to teach an introductory workshop to R for the San Francisco Data Mining meetup. Todd Holloway was kind enough to host the event at Trulia headquarters.
R can be a little daunting for beginners, so I wanted to give everyone a quick overview of its capabilities and enough material to get people started. Most importantly, the objective of this interactive session was to give everyone some time to try out some simple examples that would be useful in the future. (more...)
"Give me a lever long enough... and I shall move the world" — Archimedes
Parallelism is computing’s leverage, a force multiplier acting against the weight of big data. Cloud-hosted, horizontally scalable systems have the power to move even planetary sized data sets with speed.
This blog post discusses our efforts to lift one such data set, achieving a scan rate of 26 billions records per second, with our distributed, in-memory data store called Druid. Our main conclusions are:
Benchmarking our infrastructure against a big data set in the wild provides validation of the power achievable on a Cloud computing fabric of commodity hardware.
For those who are curious as to what our infrastructure powers, Metamarkets offers a SaaS analytics solution to gaming, social, and digital media firms. A public example is our dashboard for exploring Wikipedia edits.
I) The Data
We began our experiment with 6TB of uncompressed data, representing tens of billions of fact rows, which we aimed to host and make fully explorable through our dashboard. By way of comparison, the Wikipedia edit feed we host consists of 6GB of uncompressed data, representing ~36 million fact rows.
(more...)
Last week kicked off the first in a series of “Data Salons” we are holding here at Metamarkets. The goal, as Michael Driscoll put it, is “to bring people together and talk about cool stuff, and keep it small”. This is something we had been thinking about doing for a while and thanks to the overwhelming response from everyone involved, it was a real success.
We had a great lineup of speakers for the first topic in the series: data visualization. Following our post on the rise of interactive data visualization, we decided to bring together some of the people designing visualizations as well as the people behind the frameworks used to build them, so they could share with us some of the projects they are working on and how they approach the problems they are trying to solve.
People working in data visualization come from various different backgrounds and it is interesting to see how they embrace the engineering and design challenges involved. We see engineers becoming designers as well as designers embracing the engineering side of data visualization. At many levels it is both an art and a science, and the variety of people who attended the salon are a great example of that.
Mary Becica described how her architecture background influences how she approaches data visualization problems. She starts by putting the data into context, and letting that context inform the visual representation to give the data. Too often people start with a preconceived visual without giving extra thought to what form the data should take.
Milestones
Mike Driscoll and I founded Metamarkets in 2009 to provide large scale analytics to media companies, and I’m very gratified we have reached our objectives. Over the past 18 months, Metamarkets has built and shipped some of the most scalable, cutting-edge data analytics infrastructure in the marketplace. Our engineering and product teams have designed, built and deployed from scratch a cloud-hosted integrated analytics stack on behalf of market-leading, web-scale media businesses.
As we have pushed into the market, we have seen our product scale to accommodate the biggest, fastest-moving event sets on the internet. The core value proposition for our product has been validated through our customer set: the Metamarkets stack eliminates the need for a business to integrate multiple disparate software solutions at the data ingestion, database, analytics and visualization layers. This is something revolutionary for the data analytics industry.
The speed, scalability and usability advantages of our integrated platform are apparent to our customers. Our goal is to deliver up-to-the-minute, quantitative, operational intelligence about a business’s transaction streams, at a scale previously incomprehensible, at a cost effective price point, and for a broad customer set. In short, it’s been our goal to enable our partners to interact with their critical transaction events data when, where and how they choose; as the first thing they check when they wake up; as the last thing they check before going to sleep. Our mobile and tablet product now enable our customers to analyze their data in the middle of the night.
Without a doubt, our team’s stunning product achievements in 2011 are the accomplishment of which I’m most proud. We have been true to our original vision, and now lead the market for hosted, web-scale Business Intelligence for the global media and advertising industries. Now that Metamarkets has established its offer in an initial set of verticals, we can turn to address the broader emergent market opportunity, one that both Mike and I can honestly state is far larger than our initial conception when we launched the company a couple of years ago.
There's an unspoken truth lurking behind the scourge of Big Data and the heralding of Hadoop as its savior: while Hadoop shines as a processing platform, it is painfully slow as a query tool.
Hive was developed by the folks at Facebook in 2008, as a means of providing an easy-to-use, SQL-like query language that would compile to MapReduce code. A year later, Hive was responsible for 95% of the Hadoop jobs run on Facebook's servers. This is consistent with another observation made by Cloudera's Jeff Hammerbacher: when Hive is installed on a client's Hadoop cluster, its overall usage increases tenfold.
That data-heavy businesses can achieve visibility into the terabytes of logs that they generate is, at a primary level, a major step forward. Before the Hadoop era, this was difficult to impossible without a major engineering investment. Thus Hadoop has solved the challenge of economically processing data at scale. Hive has solved the challenge of hand-writing Hadoop queries.
But there remains a painful challenge that Hive and Hadoop does not solve for: speed.
A Powerful But Lumbering Elephant
(more...)
One of the key analytics products we provide is robust time-series forecasting over multidimensional faceted data, which requires giving users access to a number of predictions exponential in the base data dimensionality. For online publishers, such forecasting includes simple metrics like impressions, revenue and actions, and more generally we might want to model CTR or more complex metrics like the opportunity cost of showing house ads versus sponsorship ads.
Existing forecasting systems tend to exist inside of traditional OLAP storage offerings that embed a limited set of statistical functions, many of which are poorly optimized and slow to execute. The machine learning renaissance of the last decade has been tempered by powerful additions from the algorithms and distributed systems communities: compressed sensing, feature hashing and feature sharding across multiple cores have significantly lowered computational costs, while advances in regularization and structured sparsity from applied statistics have likewise increased model robustness, allowing unprecedented modeling of high-degree feature interactions.
In this post I'll outline some of the engineering challenges behind putting a large-scale machine learning system into production, focusing less on actual learning algorithm implementations (as these are fairly commoditized at this point: mahout, vowpal wabbit, scikits.learn, etc) and instead addressing our engineering architecture. In particular we'll look at time-series forecasting, separate from more generic prediction, where our end goal is to create forecasts where nothing interesting occurs.*
(more...)
The visualization below highlights something only recently possible on the web: a dynamic, interactive canvas. Titled "Disaster Strikes: A World In Sight", it visualizes a century of floods, fires, droughts, and earthquakes around the globe. (Below is a snapshot of 1996, an apparently costly year for disasters).
It's not a passively animated graphic, but one that users can actively engage with, freezing or pivoting dimensions to reveal new views of the data. It's a harbinger of a new class of documents, which digital publishers are beginning to embrace, to provide a richer information experience for readers.
In a previous blog post we introduced the distributed indexing and query processing infrastructure we call Druid. In that post, we characterized the performance and scaling challenges that motivated us to build this system in the first place. Here, we discuss three design principles underpinning its architecture.
We work with two representations of our data: alpha represents the raw, unaggregated event logs, while beta is its partially aggregated derivative. This beta is the basis against which all further queries are evaluated:
2011-01-01T01:00:00Z ultratrimfast.com google.com Male USA 1800 25 15.70 2011-01-01T01:00:00Z bieberfever.com google.com Male USA 2912 42 29.18 2011-01-01T02:00:00Z ultratrimfast.com google.com Male UK 1953 17 17.31 2011-01-01T02:00:00Z bieberfever.com google.com Male UK 3194 170 34.01
This is the most compact representation that preserves the finest grain of data, while enabling on-the-fly computation of all O(2^n) possible dimensional roll-ups.
The key to Druid’s speed is maintaining the beta data entirely in memory. Full scans are several orders of magnitude faster in memory than via disk. What we lose in having to compute roll-ups on the fly, we make up for with speed.
To support drill-downs on specific dimensions (such as results for only ‘bieberfever.com’), we maintain a set of inverted indices. This allows for fast calculation (using AND & OR operations) of rows matching a search query. The inverted index enables us to scan a limited subset of rows to compute final query results – and these scans are themselves distributed, as we discuss next.
(more...)
One weekend a few months ago Vad [1] and I were hanging around the new Metamarkets office reading Hacker News. We noticed something strange: two different headlines, both linking to identical content, resulted in dramatically different popularity ranks. Do headlines matter so much? What drives observed popularity?
We started to investigate.

(Above: Rolling 10 days of article ranks. Click for an interactive version.)
(more...)