Building a Data Pipeline That Handles Billions of Events in Real-Time

Filed in Corporate, Technology

At Metamarkets our goal is to help our clients make sense of large amounts of data in real-time. Our platform ingests tens of billions of new events every day, and currently comprises trillions of aggregated events. Our real-time analytics platform has two separate yet equally important goals: interactivity (real-time queries) and data freshness (real-time ingestion). We’ve written before about how Druid, our open-source datastore, is able to offer fast, interactive queries. In this post, we’re going to focus on the challenges around achieving data freshness. We’ll talk about the batch-oriented pipelines we started with, and how we approached building real-time […]

Building a Data Pipeline That Handles Billions of Events in Real-Time
Read Post Comments

Open Source Leaders Sound Off on The Rise of the Real-Time Data Stack

Filed in Druid, Technology

In February we were honored to speak at the O’Reilly Strata conference about building a robust, flexible, and completely open source data analytics stack. If you couldn’t make it, you can watch the video here. Preparing for our talk got us thinking about all the brilliant folks working on similar problems, so we organized a panel that same night to continue the conversation. The discussion featured key contributors to several open source technologies: Andy Feng (Storm), Eric Tschetter (Druid), Jun Rao (Kafka), and Matei Zaharia (Spark). It was moderated by VentureBeat Staff Writer Jordan Novet and hosted by Zack Bogue […]

Open Source Leaders Sound Off on The Rise of the Real-Time Data Stack
Read Post Comments

ETL: The Most Important Acronym You've Never Heard Of

Filed in Technology

This article originally appeared on AdExchanger on Thursday, February 27, 2014 Data is the fuel and the exhaust of programmatic advertising. It informs every transaction, and every transaction generates more of it. As impression volumes rise into the trillions across all manner of devices, the focus of many ad tech engineering teams isn’t on ethereal machine learning algorithms, but something far less glamorous. The process is called ETL — the critical, painstaking work of cleansing and consolidating disparate datasets. As the worlds of marketing and enterprise software collide, ETL could be the most important acronym you’ve never heard of. ETL […]

ETL: The Most Important Acronym You've Never Heard Of
Read Post Comments

How We Scaled HyperLogLog: Three Real-World Optimizations

Filed in Corporate, Druid, Technology

At Metamarkets, we specialize in converting mountains of programmatic ad data into real-time, explorable views. Because these datasets are so large and complex, we’re always looking for ways to maximize the speed and efficiency of how we deliver them to our clients.  In this post, we’re going to continue our discussion of some of the techniques we use to calculate critical metrics such as unique users and device IDs with maximum performance and accuracy. Approximation algorithms are rapidly gaining traction as the preferred way to determine the unique number of elements in high cardinality sets. In the space of cardinality […]

How We Scaled HyperLogLog: Three Real-World Optimizations
Read Post Comments

Hadley Wickham & Joe Cheng of RStudio Return to BARUG

Filed in R, Technology

Last month, we were thrilled to host Dr. Hadley Wickham and Joe Cheng, creators of ggplot and Shiny, respectively, at the Metamarkets office for another session of the Bay Area UserR Group Meetup (BARUG). Hadley introduced the crowd to dplyr, his new package that simplifies working with data frames in R, using a new verb syntax that allows users to express their data operations clearly and succinctly. Dplyr comes with several data backends, so R users can work transparently with data frames in R or SQL databases. Hadley's slides that accompany the talk are available here. Joe talked about some of the […]

Hadley Wickham & Joe Cheng of RStudio Return to BARUG
Read Post Comments