One of the key analytics products we provide is robust time-series forecasting over multidimensional faceted data, which requires giving users access to a number of predictions exponential in the base data dimensionality. For online publishers, such forecasting includes simple metrics like impressions, revenue and actions, and more generally we might want to model CTR or more complex metrics like the opportunity cost of showing house ads versus sponsorship ads.
Existing forecasting systems tend to exist inside of traditional OLAP storage offerings that embed a limited set of statistical functions, many of which are poorly optimized and slow to execute. The machine learning renaissance of the last decade has been tempered by powerful additions from the algorithms and distributed systems communities: compressed sensing, feature hashing and feature sharding across multiple cores have significantly lowered computational costs, while advances in regularization and structured sparsity from applied statistics have likewise increased model robustness, allowing unprecedented modeling of high-degree feature interactions.
In this post I’ll outline some of the engineering challenges behind putting a large-scale machine learning system into production, focusing less on actual learning algorithm implementations (as these are fairly commoditized at this point: mahout, vowpal wabbit, scikits.learn, etc) and instead addressing our engineering architecture. In particular we’ll look at time-series forecasting, separate from more generic prediction, where our end goal is to create forecasts where nothing interesting occurs.*
For forecasting, we combine predictions from two separate subsystems: a top-down “structural” model and a bottom-up “cross-correlation” model.
Top-down models capture long-term trends and other temporal correlations:
For example, some of our clients tend to see more traffic on certain sites during the Academy Awards, but less on Easter Sunday (unless they’re living in Japan or Hong Kong, etc). Through several variants of this model, we can also address transient data anomalies:
and structural changes in baseline and trend, such as new site layouts that affect aggregate traffic flows:
How these effects are integrated into forecasts depends on the semantics of the underlying metric: for online advertising impressions we may want to simply predict “baseline” guaranteed traffic; when forecasting price, we may be more interested in the structure of transient changes.
Bottom-up large-scale models capture fine-grained atemporal feature interactions. This approach trades model flexibility for brute force data mining power, capturing high-order correlations between surface data, such as city x site effects and audience x time-of-day effects. An example of what this model captures is that visitors with New York City and London IP addresses are more likely to consume financial news content on weekday mornings.
For learning, we make use of stochastic gradient descent for parameter updates, and implicit high-order feature generation to capture cross-correlations combined with aggressive regularization and low bit-order feature hashing to combat overfitting and keep updates efficient. These choices have become the de-facto state of the art for computational efficiency and prediction quality, powering systems such as Google’s Sybil and Vowpal Wabbit (cf. learning on cores, clusters and clouds workshop at NIPS).
We also built an expressive feature specification framework directly into our configuration language; it is well-established that feature engineering often has a much larger impact on predictive quality than choice of learning algorithm (cf. the BellKor Netflix Solution, KDD Cup 2010 discussions, the Stanford IR Book).
In addition to generating state of the art forecasts, we identified three main engineering desiderata:
Our underlying model framework must (1) be able to cope with multiple terabytes of data per client, and (2) gracefully degrade, automatically prioritizing the most important aspects of prediction over less important ones (maintain high precision / manipulate recall), (3) parallelizable across a commodity hadoop cluster.
Fresh data arrives daily / hourly / or even realtime, and as such we cannot waste computational resources constantly retraining the predictive models from scratch. However, online incremental learning suffers from convergence effects and drift in the underlying data. Therefore we took a hybrid approach: we couple an incremental model, updated as data arrives with an introspective changepoint detection system for forcing full model rebuilds. This approach also gives us more scaling flexibility: incremental updates run in O(hours) while batch retrains run in O(days).
Generic machine learning efforts ultimately fail; so we acknowledge the need for continual model development and backtesting. Hence we desire a system with simple, modular configuration for tuning model parameters and error metrics and clean component architecture. Our current component set includes multiple data ingestion pipelines (+ automated data versioning and incremental cacheing along the way) and several different high-level forecasting models that take into account structural changepoints, conditional heteroscedasticity, and long-term trends.
We also support additional covariance models as overlays, for example audience tags overlaid with site usage data, via online matrix factorization (think: collaborative filtering / Netflix prize). These models capture structure that is orthogonal to the underlying forecast, but nonetheless important from a user perspective. For instance, during the Women’s World Cup / Olympics, gender-based consumption of sports content exhibits strong shifts.
Despite the effort put into “future-proofing” the forecasting system, there will inevitably be features that force us to rethink our design. Currently we’re working to expose the underlying prediction stack more cleanly, in order to reduce friction for surfacing atemporal analytics. Further afield, we intend to address
- Model integration / combining outputs of multiple models hierarchically. E.g., integrated supply and demand forecasting.
- Automated prior estimation for new data facets.
- Automated prediction validation scoring framework (independent of prediction source).
[*] “Ennui” — with apologies to Sylvia Plath.