Moving Real-Time Data Flow Across Cloud Providers

Filed in Algorithms, Data Science, Druid, Technology

Eventually in the course of data growth, a company needs to make a major migration of data or processes from one physical location to another. This post is the story of how we moved a real-time data flow across cloud providers using Kafka, Samza, and some creative engineering. History Our technology stack for data processing is something we’ve spoken about before. We run a Lambda architecture with the real-time system comprising Kafka and Samza, which terminates in Druid real-time indexing tasks. The batch system is comprised of Spark, which reads and writes from S3. Druid historical nodes use S3 as […]

Moving Real-Time Data Flow Across Cloud Providers
Read Post Comments

Going Multi-Cloud with AWS and GCP: Lessons Learned at Scale

Filed in Algorithms, Druid, Industry, Technology

Metamarkets handles a lot of data. The torrent of data that clients send to us surpasses a petabyte a week. At this scale, the ability to failover gracefully, to detect and eliminate brownouts, and to efficiently operate huge quantities of byte-banging machines is necessary. We started and grew Metamarkets in AWS’s us-east region. And the majority of our footprint was in a single availability zone (AZ). As we grew, we started to see the side effects of being restricted to one AZ, then the side effects of being restricted to one region. It’s kind of like inflating a balloon in […]

Going Multi-Cloud with AWS and GCP: Lessons Learned at Scale
Read Post Comments

Autoscaling Samza with Kafka, Druid and AWS

Filed in Druid, Machine Learning, Technology

At Metamarkets, we are receiving more than 100 billion events per day, totaling more than 100 terabytes. These events are processed in real-time streams, allowing our clients to visualize and dissect them on our interactive dashboards. This data firehose must be managed in a way that is reliable without sacrificing cost efficiency. This post will demonstrate how we have implemented scaling modeling in a turbulent environment to achieve right-sizing of part of our real-time data streams. Our technical stack is based on Kafka, Samza, Spark and Druid and runs on Amazon Web Services. Incoming events are going first to Kafka, […]

Autoscaling Samza with Kafka, Druid and AWS
Read Post Comments