Real Real-Time. For Real.

Danny Yuan, Cloud System Architect at Netflix, and I recently co-presented at the Strata Conference in Santa Clara. The presentation discussed how Netflix engineers leverage Druid, Metamarkets’ open-source, distributed, real-time, analytical data store, to ingest 150,000 events per second (billions per day), equating to about 500MB/s of data at peak (terabytes per hour) while still maintaining real-time, exploratory querying capabilities. Before and after the presentation, we had some interesting chats with conference attendees. One common theme from those discussions was curiosity around the definition of “real-time” in the real world and how Netflix could possibly achieve it at those volumes. This post is a summary of the learnings from those conversations and a response to some of those questions.

What is Real-time?

Real-time has become a heavily overloaded term so it is important to properly define it. I will limit our discussion of the term to its usage in the data space as it takes on different meanings in other arenas. In the data space, it is now commonly used to refer to one of two kinds of latency: query latency and data ingestion latency.

Query latency is the rate of return of queries. It assumes a static data set and refers to the speed at which you can ask questions of that data set. Right now, the vast majority of “real-time” systems are co-opting the word real-time to refer to “fast query latency.” I do not agree with this definition of “real-time” and prefer “interactive queries,” but it is the most prevalent use of real-time and thus is worth noting.

Data ingestion latency is the amount of time it takes for an event to be reflected in your query results. An example of this would be the amount of time it takes from when someone visits your website to when you can run a query that tells you about that person’s activity on your site. When that latency is close to a few seconds, you feel like you are seeing what is going on right now or that you are seeing things in “real-time.” This is what I believe most people assume when they hear about “real-time data.” However, rapid data ingestion latency is the lesser used definition due to of the lack of infrastructure to support it at scale (tens of billions of events/terabytes of data per day), while the infrastructure to support fast query latencies is easier to create and readily available.

What’s Considered Real-Time?

Okay, now that we have a definition of real-time and that definition depends on latency, there’s the remaining question of which latencies are good enough to earn the “real-time” moniker. The truth is that it’s up to interpretation. The key point is that the people who see the output of the queries feel like they are looking at what is going on “right now.” I don’t have any scientifically-driven methods of understanding where this boundary is, but I do have experience from interacting with customers at Metamarkets.

Conclusions first, descriptions second. To be considered real-time, query latency must be below 5 seconds and data ingestion latency must be below 15 seconds.

Of course, the faster both of these occur, the better, but these are the threshold points that seem to be meaningful. There’s also an interesting phenomenon that the worst place to be is actually just a little bit slower than these thresholds. It turns out that being slightly slower is actually much more frustrating than being significantly slower because of the relative expectations set.

Specifically, with query latencies under 5 seconds, you generally get a response back fast enough so that you feel like you are interacting with your data. Waiting 5 seconds for a query can feel like a while but it’s fast enough to be “acceptable” and you can iterate on your query multiple times per minute. But, when you increase that to just 15 seconds for each query, the story changes. You just issued a query, you know the response is going to come back in a few seconds but those few seconds are not enough time for you to actually be able to switch tasks for a bit. So, you are stuck sitting there, twiddling your thumbs waiting for your response. If it’s going to take a minute or longer, you can at least read some e-mail, get a cup of coffee or do something else to fill the time.

Data ingestion latency has a similar shape. If it takes under 15 seconds for events to be reflected, people are willing to believe that they are seeing what is happening right now. If it takes over 15 minutes, they believe they are seeing what is happening in the past.  2 hours, 6 hours, 30 minutes, it’s all still the past, not “right now.” But, between the 15 second and the 15 minute mark, you feel like you should be seeing what is going on right now, but you aren’t willing to allow yourself to believe it. There’s a constant feeling of “can’t you make it faster?” You then improve things and go from 15 minutes to 10 minutes, which is good, but still not good enough, so you iterate again. You get it down to 5 minutes but it’s still not good enough. Can you do it faster? This will continue until you get to under 15 seconds, and often bridging that gap down to 15 seconds requires thinking about the problem differently than you did to achieve even 5 minutes.

If you are faced with the prospect of trying to get real-time data and are currently at the 15+ minute mark, our recommendation is to aim squarely for the 15 second mark and not to bother with iterations in the middle. They won’t provide as much value as you hope and often won’t help you hit the sub-15 second mark.

 Why Druid?

Of course, in my infinite bias, I’m going to tell you about how Druid is able to handle data ingestion latencies in the sub-15 second range. If I didn’t tell you about that, then the blog post would be quite pointless. If you are interested in how Druid is able to handle the query latency side of the endeavor, please watch the video from my October talk at Strata NY. I will continue with a discussion of the data ingestion side of the story.

How does Druid do it?

If you want to deeply understand Druid, then a great place to start is its whitepaper but we will provide a brief overview here of how the real-time ingestion piece achieves its goals. Druid handles real-time data ingestion by having a separate node type: the descriptively-named “real-time” node. Real-time nodes encapsulate the functionality to ingest and query data streams. Therefore, data indexed via these nodes is immediately available for querying. Typically, for data durability purposes, a message bus such as Kafka sits between the event creation point and the real-time node.

The purpose of the message bus is to act as a buffer for incoming events. In an event stream, the message bus maintains offsets indicating the point a real-time node has read up to. Then, the real-time nodes can update these offsets periodically.

Real-time nodes pull data in from the message bus and buffer it in indexes that do not hit disk. To minimize the impact of losing a node, the nodes will persist their indexes to disk either periodically or after some maximum size threshold is reached. After each persist, a real-time node updates the message bus, informing it of everything it has consumed so far (this is done by “committing the offset” in Kafka). If a real-time node fails and recovers, it can simply reload any indexes that were persisted to disk and continue reading the message bus from the point the last offset was committed.

Real-time nodes expose a consolidated view of the current and updated buffer and of all of the indexes persisted to disk. This allows for a consistent view of the data on the query side, while still allowing us to incrementally append data. On a periodic basis, the nodes will schedule a background task that takes all of the persisted indexes of a data source, merges them together to build a segment and uploads it to deep storage. It then signals for the historical compute nodes to begin serving the segment. Once the compute nodes load up the data and start serving requests against it, the real-time node no longer needs to maintain its older data. The real-time nodes then clean up the older segment of data and begin work on their new segment(s). The intricate and ongoing sequence of ingest, persist, merge, and handoff is completely fluid. The people querying the system are unaware of what is going on behind the scenes and they simply have a system that works.

TL;DR, but yet you somehow made it to the end of the post:

A deep understanding of the problem, specifically the end-user’s expectations and how that will affect their interactions, is key to designing a technological solution to a problem. When dealing with transparency and analytical needs for large quantities of data, the big questions around user experience that must be answered are how soon data needs to be available and how quickly queries need to return.

Hopefully this blog helped clarify the considerations around these two key components and how infrastructure can be developed to handle it.

Lastly, the shameless plug for Druid: you should use Druid.

Druid is open source, you can download it and run it on your own infrastructure for your own problems. If you are interested in learning more about Druid or trying it out, the code is available on GitHub and our wiki with documentation is available here. Finally, to complete the link soup at the bottom of our post, here is our introductory presentation at Strata and here is our most recent Strata talk with Danny about real-time in Santa Clara.

Clocks photograph by Image Club Graphics via Sean Turvey
Filed in Druid