Why Generic Machine Learning Fails

Inside the Belly of a Beast

I’ve been lucky enough to have spent a significant chunk of my time over the past five years hacking the machine learning stack at Google. It is every bit as awesome, magical and fearsome as you can imagine. It is, very literally, the beating heart of one of the world’s largest economies; it’s immensely complex, it can’t be turned off and classification error is measured in money. Fundamentally, it’s the model for practical machine learning, and there is a lot of insight to be gleaned from recent work there.

In terms of availability and scale, Metamarkets is tackling similar problems: publishers’ data flows in constantly and time-sensitive predictions need to be made in order to more efficiently manage supply volatility. Solving this problem requires a lot of moving parts: tools from machine learning, econometrics and statistics.

Bigger Data or Better Algorithms?

Peter Norvig gets a lot of flak for the more data trumps better algorithms thesis, mainly because, like all pithy statements, it’s a fair overgeneralization, and there is plenty of fascinating machine learning work going on using “not-so-big data.” To wit, recently some friends of mine, Jacob and Brendan, got really impressive results tracking regional variations in language on Twitter. This work involved a (necessarily) fancy model applied to nothing near the limit of available data. Could it have been simpler? Yes. Did it have to be to get awesome results? No. The complexity of the model was a result of the standard component library used to construct it, and scaling was less important than capturing interesting structure.

Returns for increasing data size come from two sources: (1) the importance of tails and (2) the cost of model innovation. When tails are important, or when model innovation is difficult relative to cost of data capture, then more data is the answer. The dialect work arguably could have ingested much more data, but it didn’t need to in order to draw interesting conclusions. On the other hand, Google, or for that matter the 80M TinyImages dataset, are interesting precisely because of their scale and coverage.

tiny-1024x499

My own response to Peter’s thesis would be to plot learning problems on a scalability-complexity curve, with the notion of an “efficient frontier” describing the most expressive, efficient models and their potential permutations. Data can be aggregated, sliced, and projected in various ways and completely different models with different scalability properties can be used to capture important structure independently.

photo-1-1024x768

Predicting the Past versus Predicting the Future

A major difference in industrial and academic applications of machine learning is that the latter focuses almost solely on predicting the past. The past, of course, can be really interesting; check out reconstructing Pompeiian households or predicting scholarly output by universities, but data gets stale. I love InfoChimps, data brokering is an awesome, disruptive idea, and we’ll absolutely be first in line for some future data feeds. But in practice, historical interest rates to 1970-2007 just aren’t useful when your market is in flux in 2011.

Our publishing partners at Metamarkets need to predict the future: they need to know what is changing right now. New data arrives constantly, serial correlations abound, and predictions have hard decision deadlines. This necessitates a different view of learning than the standard batch train / test paradigm: online, incremental learning as new data or features become available.

One Size Fits One Problem

I get pitched regularly by startups doing “generic machine learning” which is, in all honesty, a pretty ridiculous idea. Machine learning is not undifferentiated heavy lifting, it’s not commoditizable like EC2, and closer to design than coding. The Netflix prize is a good example: the last 10% reduction in RMSE wasn’t due to more powerful generic algorithms, but rather due to some very clever thinking about the structure of the problem; observations like “people who rate a whole slew of movies at one time tend to be rating movies they saw a long time ago” from BellKor. Flexible classification frameworks only helped in as much as they were capable of handling additional features.

How Machine Learning Fails:  Operate in a Vacuum

Prediction in real markets is a series of tradeoffs, and is most useful with a human in the loop. If my algorithm for forecasting web traffic reduces error by 30% on sites with >2M uniques, but increases error by 10% across a set of interesting demographics–or even worse, lowers error on long-range predictions (less confident) but raises in on short-range (more confident) ones–is that OK? A ton of work goes into tweaking loss functions and modeling broad-brush relevant priors, and every model is a work in progress.

Predictions will never be 100% perfect. At Metamarkets we address this by fully disclosing our track record and error rates. We prefer working with our partners to overlay their own knowledge of exogenous events on top of our predictions, rather than masquerading as an all-knowing black-box.

lift_abs-300x300
errhist-300x300

How Machine Learning Succeeds:  Involve the Decision-Makers

We process terabytes of stream data daily, track cross-elasticify of supply, and predict anomalies due to mispricing. The underlying learning processes scale effectively using online boosted gradient descent as our supply prediction workhorse. Loosely coupled with supply prediction, we employ (1) time-series models for capturing evolving cross-elasticity and temporal correlation, and (2) factor-analysis models for finding coherent market and package segments, hitting several sweet-spots in the scalability-complexity curve. Finally we expose several layers of our machine learning stack to our partners, allowing them to understand why we make the predictions (and mistakes) that we do.

All this, and a jaw-droppingly gorgeous, fully-interactive frontend. But that’s a story for another blog post.