Autoscaling Samza with Kafka, Druid and AWS

Filed in Druid, Machine Learning, Technology

At Metamarkets, we are receiving more than 100 billion events per day, totaling more than 100 terabytes. These events are processed in real-time streams, allowing our clients to visualize and dissect them on our interactive dashboards. This data firehose must be managed in a way that is reliable without sacrificing cost efficiency. This post will demonstrate how we have implemented scaling modeling in a turbulent environment to achieve right-sizing of part of our real-time data streams. Our technical stack is based on Kafka, Samza, Spark and Druid and runs on Amazon Web Services. Incoming events are going first to Kafka, […]

Autoscaling Samza with Kafka, Druid and AWS
Read Post Comments

Algorithmic Trendspotting & the Meaning of “Interesting”

Filed in Machine Learning, Technology

One challenging analytics problem that we work on at Metamarkets is anomaly detection: given a sea of trends, how do you surface the most important ones?  In the era of big data, filters are necessary to prevent drowning users in a torrent of information.  Surfacing only the starkest deviations from the expected brings focus to the unexpected and interesting. What does it mean for a trend to be “interesting”?  Is it one that is uncharacteristic “vertically” (according to its past history) or “horizontally” (relative to its peers)?  A vertical anomalies might encompass a sudden unexpected spike in revenue for a given […]

Algorithmic Trendspotting & the Meaning of “Interesting”
Read Post Comments

“Designing Futures Where Nothing Will Occur”: The Art of Forecasting

Filed in Machine Learning, Technology

One of the key analytics products we provide is robust time-series forecasting over multidimensional faceted data, which requires giving users access to a number of predictions exponential in the base data dimensionality. For online publishers, such forecasting includes simple metrics like impressions, revenue and actions, and more generally we might want to model CTR or more complex metrics like the opportunity cost of showing house ads versus sponsorship ads. Existing forecasting systems tend to exist inside of traditional OLAP storage offerings that embed a limited set of statistical functions, many of which are poorly optimized and slow to execute. The […]

“Designing Futures Where Nothing Will Occur”:  The Art of Forecasting
Read Post Comments

Hacking Hacker News Headlines

Filed in Fun, Machine Learning, Technology

One weekend a few months ago Vad [1] and I were hanging around the new Metamarkets office reading Hacker News.  We noticed something strange: two different headlines, both linking to identical content, resulted in dramatically different popularity ranks.  Do headlines matter so much? What drives observed popularity? We started to investigate. (Above: Rolling 10 days of article ranks. Click for an interactive version.)

Hacking Hacker News Headlines
Read Post Comments

Why Generic Machine Learning Fails

Filed in Machine Learning, Musings, Technology

Inside the Belly of a Beast I’ve been lucky enough to have spent a significant chunk of my time over the past five years hacking the machine learning stack at Google. It is every bit as awesome, magical and fearsome as you can imagine. It is, very literally, the beating heart of one of the world’s largest economies; it’s immensely complex, it can’t be turned off and classification error is measured in money. Fundamentally, it’s the model for practical machine learning, and there is a lot of insight to be gleaned from recent work there. In terms of availability and […]

Why Generic Machine Learning Fails
Read Post Comments