Druid and Spark Together – Mixing Analytics Workflows

Filed in Druid, Technology

One method of looking at the human thought process is that we have different brain workflows for different analytics and data processing needs. At Metamarkets we have the same thing for our data processing machines. This post will explore some of our experience with bin-packing query nodes featuring Druid with batch processing featuring Spark, using Apache Mesos as the resource coordinator. Thinking Fast The above shows a typical “work” pattern for a Druid historical node throughout a typical week day (times shown are EDT). It shows the quantity of CPU seconds consumed by the JVM to answer queries. There are […]

Druid and Spark Together – Mixing Analytics Workflows
Read Post Comments

Moving Real-Time Data Flow Across Cloud Providers

Filed in Algorithms, Data Science, Druid, Technology

Eventually in the course of data growth, a company needs to make a major migration of data or processes from one physical location to another. This post is the story of how we moved a real-time data flow across cloud providers using Kafka, Samza, and some creative engineering. History Our technology stack for data processing is something we’ve spoken about before. We run a Lambda architecture with the real-time system comprising Kafka and Samza, which terminates in Druid real-time indexing tasks. The batch system is comprised of Spark, which reads and writes from S3. Druid historical nodes use S3 as […]

Moving Real-Time Data Flow Across Cloud Providers
Read Post Comments

Going Multi-Cloud with AWS and GCP: Lessons Learned at Scale

Filed in Algorithms, Druid, Industry, Technology

Metamarkets handles a lot of data. The torrent of data that clients send to us surpasses a petabyte a week. At this scale, the ability to failover gracefully, to detect and eliminate brownouts, and to efficiently operate huge quantities of byte-banging machines is necessary. We started and grew Metamarkets in AWS’s us-east region. And the majority of our footprint was in a single availability zone (AZ). As we grew, we started to see the side effects of being restricted to one AZ, then the side effects of being restricted to one region. It’s kind of like inflating a balloon in […]

Going Multi-Cloud with AWS and GCP: Lessons Learned at Scale
Read Post Comments

Druid Query Optimization with FIFO: Lessons from Our 5000-Core Cluster

Filed in Druid, Technology

Druid’s Horizontal Scale A large strength of using Druid as a data store and aggregation engine is its ability to horizontally scale. Whenever more data is in the system, or whenever faster compute times are desired, it is simply a matter of throwing more hardware at the problem, and Druid auto-detects, and auto-balances its workloads. At Metamarkets we are currently ingesting over 3M events/ second (replicated) into our Druid cluster and have multiple hundreds of historical nodes serving this data across multiple tiers. Part of the power of this horizontal scale is how Druid breaks up data into shards. Each […]

Druid Query Optimization with FIFO: Lessons from Our 5000-Core Cluster
Read Post Comments

Effect of Frequency Governor on Java Benchmarking

Filed in Technology

A very common tool in a programmer’s arsenal is a MacBook Pro (MBP). However, a major OSX drawback for a developer is the lack of easy, fine grain control over kernel behaviors similar to that found in machines kerneled with Linux or raw BSD. In this post, we will explore the effect of the frequency governor on MBPs running with a modern Intel chip. The wall-time query execution speed in Druid will be used as a simple java benchmark. One of the most common tasks during the course of evaluating code is to look at key bottlenecks in execution time. […]

Effect of Frequency Governor on Java Benchmarking
Read Post Comments