Druid and Spark Together – Mixing Analytics Workflows

Filed in Druid, Technology

One method of looking at the human thought process is that we have different brain workflows for different analytics and data processing needs. At Metamarkets we have the same thing for our data processing machines. This post will explore some of our experience with bin-packing query nodes featuring Druid with batch processing featuring Spark, using Apache Mesos as the resource coordinator. Thinking Fast The above shows a typical “work” pattern for a Druid historical node throughout a typical week day (times shown are EDT). It shows the quantity of CPU seconds consumed by the JVM to answer queries. There are […]

Druid and Spark Together – Mixing Analytics Workflows
Read Post Comments

Managing a Large-scale Spark Cluster with Mesos

Filed in Best Practices, Druid, Technology

At Metamarkets, we ingest more than 100 billion events per day, which are processed both realtime and batch. We store received events to our Kafka cluster and the stored events in Kafka are processed by both Samza and Spark for real-time stream processing and batch processing, respectively. We have clients who send us data in batch only, but batch processing is done for clients who send us data in real-time in order to fix up any incorrectness in produced data, including deduplicating and joining events that were outside of the real-time join window. Batch processing is a two-step operation where […]

Managing a Large-scale Spark Cluster with Mesos
Read Post Comments