Four Lessons for Building A Petabyte Data Platform

January 31st, 2011 Mike Driscoll

In the last year, we at Metamarkets have built a big data stack for processing, analyzing, and visualizing nearly one trillion ad pricing events from our partners around the globe.

In this post I’ll share some of the thinking behind our choices for the Big Data stack that powers our petabyte platform, consisting of three layers (i) a processing and storage substrate based around Hadoop and HBase, (ii) an analytics engine that mixes R, Python, and Pig and (iii) a visualization console and data API built principally in Javascript.

Here are four guiding principles:

1. Experiment Often, Fail Fast

Building software in a start-up environment is rife with uncertainty, which can be dealt with in one of two ways: (i) deep theoretical analysis, which consists of debating the merits of solutions you don’t fully understand, (ii) bold empiricism, which consists of implementing solutions that you don’t fully understand. In all cases, the latter approach — do something — is to be favored.

When we were setting up our original data layer, we experimented with Vertica, InfoBright, and Greenplum for relational data stores, and chose Greenplum. When we found ourselves pushing the limits of performance on our Greenplum box (we were running the free, community edition on a multi-core box), we knew we had to “fail fast.” We re-architected our backend around a non-relational approach. We experimented with both Cassandra and HBase, and eventually chose HBase.

When we first built our front-end console, to achieve moving dynamic, event-driven graphics we experimented with embedded-Flash charts (AmCharts), an R server to generate static PNGs, and finally with a Javascript framework to render SVGs client-side (Protovis). We ultimately chose Protovis.

We’ve faced uncertainty at every level in our stack, but the answer in each case has been: embrace experimentation, learn from failure, and keep moving forward.

2. To Scale: Keep It Simple

What’s trite is often true, and few mantras in engineering are more repeated or true than ‘keep it simple.’

Simple systems are harder to design, because you have to decide what to discard, but they are easier to build. And because they are smaller, they have fewer points of failure.

When we first started designing our HBase data store, we had a choice to make about how we stored keys: we could store them as full text (e.g. “Category:Entertainment,Country:United States”), or we could store the much shorter ids (“Category:101,Country:31”), and do the key lookups later.

We opted for the latter approach, and stored only the ids. This required building and maintaining an entirely separate meta-store that contained the label lookups. And it also meant the data in our database wasn’t easily inspectable.

Eventually, we realized the added complexity wasn’t worth it, and we now store the full text labels directly in HBase.

In my experience, the benefits of encoding text data into obscure but optimized binary formats or into numeric ids are rarely worth the costs incurred by the complexity gain and visibility loss.

In Founders at Work, Delicious.com founder Joshua Schachter spoke about simplicity:

Livingston:’What is your favorite bit of advice you’d give to a technical person who wanted to start a start-up?’

Schachter: ‘Reduce . Do as little as possible to get what you have to get done. Do less of it; get it done… Doing less is so important.’

3. To Adapt: Keep it Modular

One thing that all large-scale systems share — whether organisms, organizations, or technology architectures — is that they are built of composable pieces, or modules, with standard interfaces between them.

Modularity is a means to manage complexity in systems, but it also provides flexibility.

When we began building our data layer, we anticipated that the underlying implementation would shift away from a relational system that spoke SQL. So Eric Tschetter developed what we call ‘standard query URL’ – or sqURLly for short — a server that takes GET requests that looks like this:

http://squrl.metamarketsgroup.net/query/2010-10-08_2010-10-08T20;publisher['fox'],advertiser['billy'];[["volume" (hourly + impressions)] ["revenue" (hourly + revenue)

and returns a JSON response like this:

[ {"timestamp": "2010-10-08T01:00:00.000Z", "publisher":"fox", "advertiser":"billy", "volume":"27", "revenue":"7.32"}, {"timestamp": "2010-10-08T02:00:00.000Z", "publisher":"fox", "advertiser":"billy", "volume":"2", "revenue":"483939.32"}, ... ]

We’ve made several changes to the underlying implementation of our data layer, from Greenplum to HBase, but the services that depend on it never noticed: this RESTful interface has remained constant.

We have taken this RESTful approach with pieces of our analytics architecture as well. Inspired by the successes of Ruby on Rails, Ruby’s Sinatra, and Python’s Django framework, we are working with a team to develop Raconteur, a web framework for the R programming language. Raconteur builds on the success of Jeff Horner’s rApache, an Apache 2.0 module that embeds the R interpreter inside Apache.

This has allowed us to access R’s powerful predictive analytics functions — clustering functions or ARIMA time-series modelers — through simple, web-based API calls.

4. Avoid Doing “Undifferentiated Heavy Lifting”

Jeff Bezos tells a story of a Belgian brewery that, at the end of the 19th century, had its own power generator. Yet making power was unlikely to be the source of competitive advantage in making beer. As soon as electrical grids matured, these on-site generators became a relic of the past. Power generation was what Bezos calls “undifferentiated heavy lifting.”

Likewise in the technology realm, cloud computing services like Amazon’s have allowed start-ups like ours leasable, scalable access to storage and compute power. We’re currently running O(100) machines in EC2, including our HBase cluster, and may reach O(1000) by the end of this year.

But the principle can be applied more generally, across all of one’s technology needs. We pay for a lot of software as a service: Github, Pivotal Tracker, DropBox, Artifactory Online. Any of these services we could set up ourselves, locally, but we’d rather than put our precious engineering resources elsewhere.

We also borrow amply and judiciously from open source projects. For the vast majority of pieces in our platform — from operating systems (Ubuntu), databases (HBase), monitoring tools (Ganglia), and web frameworks (Django and now node.js) — there have been adequate, open-source solutions for us.

As we grow, we expect that we may need to have greater control over our server layouts and migrate off of Amazon’s cloud, or that some of the open-source tools we’re using will be insufficient for our needs. But those decisions will be made with an eye towards what differentiates us and serves our customers.

Important Note!

Four Lessons for Building A Petabyte Data Platform

January 31st, 2011 Mike Driscoll

Filed in Technology