Here at Metamarkets we have developed a web-based analytics console that supports drill-downs and roll-ups of high dimensional data sets – comprising billions of events – in real-time. This is the first of two blog posts introducing Druid, the data store that powers our console. Over the last twelve months, we tried and failed to achieve scale and speed with relational databases (Greenplum, InfoBright, MySQL) and NoSQL offerings (HBase). So instead we did something crazy: we rolled our own database. Druid is the distributed, in-memory OLAP data store that resulted.
The Challenge: Fast Roll-Ups Over Big Data
To frame our discussion, let’s begin with an illustration of what our raw impression event logs look like, containing many dimensions and two metrics (click and price).
timestamp publisher advertiser gender country .. dimensions .. click price 2011-01-01T01:01:35Z bieberfever.com google.com Male USA 0 0.65 2011-01-01T01:03:63Z bieberfever.com google.com Male USA 0 0.62 2011-01-01T01:04:51Z bieberfever.com google.com Male USA 1 0.45 ... 2011-01-01T01:00:00Z ultratrimfast.com google.com Female UK 0 0.87 2011-01-01T02:00:00Z ultratrimfast.com google.com Female UK 0 0.99 2011-01-01T02:00:00Z ultratrimfast.com google.com Female UK 1 1.53 ...
We call this our alpha data set. We perform a first-level aggregation operation over a selected set of dimensions, equivalent to (in pseudocode):
GROUP BY timestamp, publisher, advertiser, gender, country
:: impressions = COUNT(1), clicks = SUM(click), revenue = SUM(price)
to yield a compacted version:
timestamp publisher advertiser gender country impressions clicks revenue 2011-01-01T01:00:00Z ultratrimfast.com google.com Male USA 1800 25 15.70 2011-01-01T01:00:00Z bieberfever.com google.com Male USA 2912 42 29.18 2011-01-01T02:00:00Z ultratrimfast.com google.com Male UK 1953 17 17.31 2011-01-01T02:00:00Z bieberfever.com google.com Male UK 3194 170 34.01
This is our beta data set, filtered for five selected dimensions and compacted. In the limit, as we group by all available dimensions, the size of this aggregated beta converges to the original alpha . In practice, it is dramatically smaller (often by a factor of 100). Our beta data comprises three distinct parts:
Timestamp column: We treat timestamp separately because all of our queries center around the time axis. Timestamps are faceted by varying granularities (hourly, in the example above).
Dimension columns: Here we have four dimensions of publisher, advertiser, gender, and country. They each represent an axis of the data that we’ve chosen to slice across.
Metric columns: These are impressions, clicks and revenue. These represent values, usually numeric, which are derived from an aggregation operation – such as count, sum, and mean (we also run variance and higher moment calculations). For example, in the first row, the revenue metric of 15.70 is the sum of 1800 event-level prices.
Our goal is to rapidly compute drill-downs and roll-ups over this data set. We want to answer questions like “How many impressions from males were on bieberfever.com?” and “What is the average cost to advertise to women at ultratrimfast.com?” But we have a hard requirement to meet: we want queries over any arbitrary combination of dimensions at sub-second latencies.
Performance of such a system is dependent on the size of our beta set, and there are two ways that this becomes large: (i) when we include additional dimensions, and (ii) when we include a dimension whose cardinality is large. Using our example, for every hour’s worth of data we calculate the maximum number of rows as:
number_of_publishers * number_of_advertisers * number_of_genders * number of countries
If we have 10 publishers, 50 advertisers, 2 genders, and 120 countries, that would yield a maximum of 120,000 rows. If there had been 1,000,000 possible publishers, it would become a maximum of 12 billion rows. If we add 10 more dimensions of cardinality 10, then it becomes a maximum of 1.2 quadrillion (1.2 x 10^15) rows.
Luckily for us, these data sets are generally sparse, as dimension values are not conditionally independent (few Kazakhstanis visit beiberfever.com, for example). Thus the combinatorial explosion is far less than the theoretical worst-case. Nonetheless, as a rule, more dimensions and more cardinality dramatically inflate the size of the data set.
Failed Solution I: Dynamic Roll-Ups with a RDBMS
Our stated goals of aggregation and drill-down are well suited to a classical relational architecture. So about a year ago, we fired up a RDBMS instance (actually, the Greenplum Community Edition, running on an m1.large EC2 box), and began loading our data into it. It worked and we were able to build the initial version of our console on this system. However, we had two problems:
- We stored the data in a star schema, which meant that there was operational overhead maintaining dimension and fact tables.
- Whenever we needed to do a full table scan, for things like global counts, the queries ran slow. For example, naive benchmarks showed scanning 33 million rows took 3 seconds.
We initially just decided to eat the operational overhead of (1) because that’s how these systems work and we benefited from having the database to do our storage and computation. But, (2) was painful. We started materializing all dimensional roll-ups of a certain depth, and began routing queries to these pre-aggregated tables. We also implemented a caching layer in front of our queries.
This approach generally worked and is, I believe, a fairly common strategy in the space. Except, when things weren’t in the cache and a query couldn’t be mapped to a pre-aggregated table, we were back to full scans and slow performance. We tried indexing our way out of it, but given that we are allowing arbitrary combinations of dimensions, we couldn’t really take advantage of composite indexes. Additionally, index merge strategies are not always implemented, or only implemented for bitmap indexes, depending on the flavor of RDBMS.
We also benchmarked plain Postgres, MySQL, and InfoBright, but did not observe dramatically better performance. Seeing no path ahead for our relational database, we turned to one of those new-fangled, massively scalable NOSQL solutions.
Failed Solution II: Pre-compute the World in NoSQL
We used a data storage schema very similar to Twitter’s Rainbird.
In short, we took all of our data and pre-computed aggregates for every combination of dimensions. At query time we need only locate the specific pre-computed aggregate and and return it: an O(1) key-value lookup. This made things fast and worked wonderfully when we had a six dimension beta data set. But when we added five more dimensions – giving us 11 dimensions total – the time to pre-compute all aggregates became unmanageably large (such that we never waited more than 24 hours required to see it finish).
So we decided to limit the depth that we aggregated to. By only pre-computing aggregates of five dimensions or less, we were able to limit some of the exponential expansion of the data. The data became manageable again, meaning it only took about 4 hours on 15 machines to compute the expansion of a 500k beta rows into the full multi-billion entry output data set.
Then we added three more dimensions, bringing us up to 14. This turned into 9 hours on 25 machines. We realized that we were doing it wrong.
Lesson learned: massively scalable counter systems like rainbird are intended for high cardinality data sets with pre-defined hierarchical drill-downs. But they break down when supporting arbitrary drill downs across all dimensions.
Introducing Druid: A Distributed, In-Memory OLAP Store
Stepping back from our two failures, let’s examine why these systems failed to scale for our needs:
- Relational Database Architectures
- Full table scans were slow, regardless of the storage engine used
- Maintaining proper dimension tables, indexes and aggregate tables was painful
- Parallelization of queries was not always supported or non-trivial
- Massive NOSQL With Pre-Computation
- Supporting high dimensional OLAP requires pre-computing an exponentially large amount of data
Looking at the problems with these solutions, it looks like the first, RDBMS-style architecture has a simpler issue to tackle: namely, how to scan tables fast? When we were looking at our 500k row data set, someone remarked, “Dude, I can store that in memory”. That was the answer.
Keeping everything in memory provides fast scans, but it does introduce a new problem: machine memory is limited. The corollary thus was: distribute the data over multiple machines. Thus, our requirements were:
- Ability to load up, store, and query data sets in memory
- Parallelized architecture that allows us to add more machines in order to relieve memory pressure
And then we threw in a couple more that seemed like good ideas:
- Parallelized queries to speed up full scan processing
- No dimensional tables to manage
These are the requirements we used to implement Druid. The system makes a number of simplifying assumptions that fit our use case (namely that all analytics are time-based) and integrates access to real-time and historical data for a configurable amount of time into the past.
The next installment will go into the architecture of Druid, how queries work and how the system can scale out to handle query hotspots and high cardinality data sets. For now, we leave you with a benchmark:
- Our 40-instance (m2.2xlarge) cluster can scan, filter, and aggregate 1 billion rows in 950 milliseconds.
Looking for more Druid information? Learn more about our core technology.
Interested in tackling engineering challenges like this one? Come join the Metamarkets Team.
21 Comments on “Introducing Druid: Real-Time Analytics at a Billion Rows Per Second”
sean said on April 30, 2011:
"When we were looking at our 500k row data set, someone remarked, 'Dude, I could store that in memory'. That was the answer."
You could have used memcached for this, which is already fast, distrivbuted, etc. Or Redis. Or Riak. Or any other good key=>value store.
jtregunna said on April 30, 2011:
Very good writeup. Is this something you're planning on releasing to the public at some point?
Ben said on April 30, 2011:
How does your solution compare to what people at VoltDB are doing ?
Jay said on April 30, 2011:
So, where is it?
Tim said on April 30, 2011:
Did you guys look into Solr with its high performance search and faceting functionality? from what I understand about the problem you were trying to solve here (smaller data set, multi dimensional faceting/filtering, flexible data schema) - Solr's could have been a quite decent option. It's orders of magnitude faster than relational databases, scales quite well (to a reason), and is very simple to manage. I find it a good substitute for high performance data analytics without all the hassle and cost of other solutions. And that it does tf-idf for those pesky keyword searches is just an added bonus
Also, Solr's results caching is in my opinion unmatched...
Etrain said on May 1, 2011:
Sean, you've clearly never worked with serious analytics. The data transfer required to do this kind of arbitrary aggregation in a bullshit insert/lookup-only kv store is laughably huge.
Really looking forward to what these guys have come up with. Would really like to see a disk-backed version of this.
Alice said on May 1, 2011:
When you say 1B rows, is that the number of inserts per second? Or merely the size of the corpus being scanned per-query.
Usually one talks about insert rate; how many actual new data rows are you inserting per second? What is the current query throughput? Am I right in understanding that you have a scatter-gather architecture that does not scale query cost linearly?
kidplumber said on May 1, 2011:
"You could have used memcached for this, which is already fast, distrivbuted, etc. Or Redis. Or Riak. or any other good key=>value store."
that would only address the storage aspect of the problem. The way I'm reading it, the problem is not in data storage, but in query performance.
The first solution looks like they ran into cross joins across 14+ tables over 500k rowsets.
Second sounds like they actually did try the traditional key=>value solution. Process the dataset and store the result in a large distributed k->v system for fast storage/retrieval by the front side. Seems like the problem was that processing speed would not scale effectively against a workload which grows exponentially.
Essentially, druid is a reversion to the rdbms architecture, except it sounds like it's designed specifically to handle the query load that failboats greenplum with ms timings obviating the need for extensive pre-computation or caching.
I'd be curious tho as to how druid deals with operational hiccups like dropping nodes, soft-failures on the backend, network splits, latency spikes, and the like.
Christian G. Warden said on May 1, 2011:
Did you look at Palo? It's not distributed, but as an open source MOLAP engine, it might have been a good base from which start.
Rajat Goel said on May 1, 2011:
Is this system only meant for roll-up queries or do you also plan to support queries like "which publisher created the most revenue?".
Adam said on May 2, 2011:
Out of curiosity, what are some of the features that you've built or are planning to build in Druid that makes you consider it an "OLAP like" implementation? Aggregation is obviously an important part, but are you building a formula/expression language (similar to MDX) that allows for evaluation with relative scope at scale (i.e. rolling averages, traversing parent/child, % of parent, etc)?
If so, would you mind discussing some of the tactics on how to evaluate these expressions in a distributed way while maintaining the high level of performance you're aiming for?
Benjamin Darfler said on May 3, 2011:
With the crazy amount of hardward you ended up throwing at the problem you really should have evaluated other NoSQL options like MongoDB or commercial offerings such as Vertica. Maybe the follow up post will better explain how Druid works and why these options wouldn't but I have a sneaking suspicion that you did it wrong.
Sumit Pal said on May 5, 2011:
This is quite true and something similar to what we faced as well. We use an Open Source OLAP Engine. We however have 2 things that allowed us to use an OLAP Engine. We were limited to just 5 Dimensions and our Data Sizes are about 300 Million rows. However 1 of the Dimensions had a cardinality of 1 million - which was probelematic for the Olap Engine to handle.
However, what the author says here makes perfect sense and I can see how you guys must have run into issues with large data, larger number of dimensions and also increasing cardinality of some dimensions.
We developed a logic of doing Cube Switching - created Aggregate Lattices - to aggregate the data and then our Cube Switcher switched the cubes dynamically based on which one would give us the best response time.
Aggretate lattices become more complicated if there are larger number of dimensions and each of those dimensions also have Hierarchies or more than 1 Hierarchies.
Brian f said on May 13, 2011:
What exactly does "•Our 40-instance (m2.2xlarge) cluster can scan, filter, and aggregate 1 billion rows in 950 milliseconds." mean? Do the queries scan all 1 billion rows, or are your queries using aggregates? Many people have deployed OLAP solutions with several billion rows, but most queries do not scan the entire data set - hence preaggregation. A query pulling yearly sales numbers might only pull 12 numbers, the sales for each month in the year.
Here is a better question, can you do a distinct count against your 1 billion rows in 950 milliseconds?
Pingback: Building a Big Analytics Infrastructure « AK Tech Blog
[...] problems. Some notable exceptions are Facebook, Twitter’s Rainbird and MetaMarket’s Druid. In this post we provide an overview of how we built Aggregate Knowledge’s “big [...]
Pingback: Market Data – Big Data Thoughts « Tales from a Trading Desk
[...] technology offerings at the time to be in appropriate for Big Data, and when their own way with Druid, leveraging a data storage schema very similar to Twitter’s Rainbird. Metamarkets design [...]
Noah Yetter said on October 24, 2012:
"Our 40-instance (m2.2xlarge) cluster can scan, filter, and aggregate 1 billion rows in 950 milliseconds."
What do you mean by "scan"? Scan out of memory I would believe, but that's not how any reasonable reader would interpret "scan". You cannot peel a billion rows off of EBS in 1 second, unless perhaps they were each 1 byte long. Instance storage... maybe, but that's not safe. Perhaps you are distributing the storage and therefore the I/O as well?
Basically your claims are too fantastic and too vague to carry any weight. I/O, not processing power, is the great bottleneck of data warehousing, and you haven't even mentioned it. Where's the beef?
@angrynoah
Pingback: Druid Open Sourced | … and points beyond
[...] Druid, a scalable, in-memory analytical database has been open sourced. I first wrote about them in [...]
Pingback: Metamarkets open sources distributed database Druid « Another Word For It
[...] to real-time analysis. Metamarkets, who provide “Data Science-as-a-Service” business analytics, last year revealed details of in-house distributed database Druid – and have this week released it as [...]
Eric Tschetter said on October 29, 2012:
"What do you mean by "scan"? Scan out of memory I would believe, but that's not how any reasonable reader would interpret "scan". You cannot peel a billion rows off of EBS in 1 second, unless perhaps they were each 1 byte long. Instance storage... maybe, but that's not safe. Perhaps you are distributing the storage and therefore the I/O as well?"
"Scan" means scan across the data. In this case, the data set is all in memory ahead of time. At the time of writing of this blog post, the system actually only operated by first loading data into the JVM heap and then ran queries against that.
You are correct in saying that you aren't going to get the IO throughput to scan 1 billion things in 1 second off of a single EBS volume. When things do go to disk we are distributing that IO across the machines in the cluster, so it can and will go faster than just a single volume, but if your workloads are always hitting disk all the time, you will not see the same performance. There's no way to get around physics, but depending on the value of the data set, it can be possible to set it up such that your working set is primarily in memory, which just means that your processing will go that much faster.
Pingback: Philly ETE – Druid Column Oriented Database — Gary Sieling
[...] http://metamarkets.com/2011/druid-part-i-real-time-analytics-at-a-billion-rows-per-second/ [...]
Leave a Comment