In April 2011, we introduced Druid, our distributed, real-time data store. Today I am extremely proud to announce that we are releasing the Druid data store to the community as an open source project. To mark this special occasion, I wanted to recap why we built Druid, and why we believe there is broader utility for Druid beyond Metamarkets’ analytical SaaS offering.
When we started to build Metamarkets’ analytics solution, we tried several commercially available data stores. Our requirements were driven by our online advertising customers who have data volumes often upwards of hundreds of billions of events per month, and need highly interactive queries on the latest data as well as an ability to arbitrarily filter across any dimension – with data sets that contain 30 dimensions or more. For example, a typical query might be “find me how many advertisements were seen by female executives, aged 35 to 44, from the US, UK, and Canada, reading sports blogs on weekends.”
First, we went the traditional database route. Companies have successfully used data warehouses to manage the cost and overhead of querying historical data, and the architecture aligned with our goals of data aggregation and drill down. For our data volumes (reaching billions of records), we found that the scan rates were not fast enough to support our interactive dashboard, and caching could not be used to reliably speed up queries due to the arbitrary drill-downs we need to support. In addition, because RDBMS data updates are inherently batch, updates made via inserts lead to locking of rows for queries.
Next, we investigated a NoSQL architecture. To support our use case of allowing users to drill down on arbitrary dimensions, we pre-computed dimensional aggregations and wrote them into a NoSQL key-value store. While this approach provided fast query times, pre-aggregations required hours of processing time for just millions of records (on a ~10-node Hadoop cluster). More problematically, as the number of dimensions increased, the aggregation and processing time increased exponentially, exceeding 24 hours. Beyond its cost, this time created an unacceptably high latency between when events occurred and when they were available for querying – negating any possibility of supporting our customers’ desire for real-time visibility into their data.
We thus decided to build Druid, to better meet the needs of analytics workloads requiring fast, real-time access to data at scale.
Druid’s key features are:
- Distributed architecture. Swappable read-only data segments using an MVCC swapping protocol. Per-segment replication relieves load on hot segments. Supports both in-memory and memory-mapped versions.
- Real-time ingestion. Real-time ingestion coupled with broker servers to query across real-time and historical data. Automated migration of real-time to historical as it ages.
- Column-oriented for speed. Data is laid out in columns so that scans are limited to specific data being searched. Compression decreases overall data footprint.
- Fast filtering. Bitmap indices with CONCISE compression.
- Operational simplicity. Fault tolerant due to replication. Supports rolling deployments and restarts. Allows simple scale up and scale down – just add or remove nodes.
From a query perspective, Druid supports arbitrary Boolean filters as well as Group By, time series roll-ups, aggregation functions and regular expression searches.
In terms of performance, Druid’s scan speed is 33M rows per second per core, and can ingest up to 10K incoming records per second per node. We have horizontally scaled Druid to support scan speeds of 26B records per second.
Now that more people have hands-on experience with Hadoop, there is a broadening realization that while it is ideal for batch processing of large data volumes, tools for real-time data queries are lacking. Hence there is growing interest in tools like Google’s Dremel and PowerDrill, as evidenced by the new Apache Drill project. We believe that Druid addresses a gap in the existing big data ecosystem for a real-time analytical data store, and we are excited to make it available to the open source community.
Metamarkets has engaged with multiple large internet properties like Netflix, providing early access to the code for evaluation purposes. Netflix is assessing Druid for operational monitoring of real-time metrics across their streaming media business.
Sudhir Tonse, Manager, Cloud Platform Infrastructure says, “Netflix manages billions of streaming events each day, so we need a highly scalable data store for operational reporting. We are so far impressed with the speed and scalability of Druid, and are continuing to evaluate it for providing critical real-time transparency into our operational metrics.”
Metamarkets anticipates that open sourcing Druid will also help other organizations solve their real-time data analysis and processing needs. We are excited to see how the open source community benefits from using Druid in their own applications, and hopeful that Druid improves through their feedback and usage.
Druid is available for download on GitHub at https://github.com/metamx/druid, and more information can be found on the Druid project website.
9 Comments on “Introducing Druid: The Real-Time Analytics Data Store”
Pingback: Metamarkets Open Sources Druid, A Real-Time Analytics Data Store | My Daily Feeds
[...] Hacker News http://metamarkets.com/2012/metamarkets-open-sources-druid/ wandelingen This entry was posted in Uncategorized by admin. Bookmark the [...]
Pingback: Four short links: 25 October 2012 - O'Reilly Radar
[...] druid (github) — open source (GPLv2) a distributed, column-oriented analytical datastore. It was originally created to resolve query latency issues seen with trying to use Hadoop to power an interactive service. See also the announcement of its open-sourcing. [...]
Pingback: Druid Goes Open Source « Tales from a Trading Desk
[...] Goes Open Source Interesting to see that MetaMarkets has pushed Druid into the Open Source world. The history of Druid is interesting, specifically the path [...]
Pingback: Web Scale Analytics Reading List « CodeDependents
[...] http://metamarkets.com/2012/metamarkets-open-sources-druid/ Share this:ShareEmailPrintDiggFacebookStumbleUponRedditTwitterLike this:LikeBe the first to like this. [...]
Pingback: 10 Takeaways From Hadoop World « Mehtaphysical
[...] announced the open source Impala project while Metamarkets open sourced its own related work on Druid. Vendors like Platfora separately had in-memory OLAP as a core part of their [...]
Pingback: Druid Open Sourced | … and points beyond
[...] Druid, a scalable, in-memory analytical database has been open sourced. I first wrote about them in January. Related [...]
diet said on January 8, 2013:
I enjoyed reading this. Thanks for posting this. I'll definitely return again to read more and inform my friends about your writing.
Adam said on January 12, 2013:
Has anyone successfully connected to Druid using existing tools like Pentaho?
My issue is that Druid does not include a front end reporting suite (I.e. dashboarding, analytics etc).
My thinking therefore, was whether it is possible to use the excellent offering of the Pentaho tool set with the Druid In Memory backend.
Has anyone done this? If so, is it possible to share your experience? The Druid documentation is quit lacking.
thanks in advance.
Eric Tschetter said on January 14, 2013:
Adam,
As far as I am aware, nobody has connected up any of the Pentaho tools to Druid. Most of the people who have integrated so far has either (1) used Hadoop or (2) used Kafka.
It's possible that there are people lurking on the list who have tried this, but I'm not sure.
druid-development at googlegroups.com
Leave a Comment