One challenging analytics problem that we work on at Metamarkets is anomaly detection: given a sea of trends, how do you surface the most important ones? In the era of big data, filters are necessary to prevent drowning users in a torrent of information. Surfacing only the starkest deviations from the expected brings focus to the unexpected and interesting.
What does it mean for a trend to be “interesting”? Is it one that is uncharacteristic “vertically” (according to its past history) or “horizontally” (relative to its peers)? A vertical anomalies might encompass a sudden unexpected spike in revenue for a given retailer in an industry, for example. This would also be classified as a horizontal anomaly except in the case of an industry-wide boom. In this post, we consider vertical anomalies in time series data with an algorithm that extends naturally to horizontal anomalies across peer sets (the latter we will address more deeply in an upcoming post).
Matrix Decompositions and Robust PCA
Matrix decomposition techniques are ubiquitous in data analysis and form the core of many data products. Netflix has recently overhauled their recommendation engine with singular value decomposition based techniques that have scaled to more than 5 billion movie ratings. Robust PCA, or RPCA (loosely including its generalization to the noisy setting: stable principal component pursuit), can be seen as a more resistant version of the traditional SVD, allowing it to extract and ignore highly corrupted entries: given a data matrix X, RPCA yields the decomposition X = L + S + E, where L is a matrix of low-rank, S a sparse matrix of grossly corrupted entries, and E a dense error matrix of small order. In the traditional SVD, S is the matrix of zeroes. Geometrically, we expect the bulk of the data to lie close the low-dimensional linear subspace L, so this can be thought of as a subspace outlier detection problem.
RPCA has been applied to the following domains:
- Video surveillance footage: L (stationary background) + S (moving foreground)
- Face recognition: L (face) + S (shadows, specularities, differences in brightness)
- Latent semantic indexing: L (words common to documents) + S (key distinguishing words)
- Collaborative filtering: L (structure for “good” data) + S (an allowance for particularly noisy or tampered with data)
Demonstration of video surveillance application:
Here we see a time-series representing the number of hourly edits of wikipedia pages in the Czech language. There are clear daily periodicities and some large anomalies in the vertical sense: some hours exhibit erratic behavior that doesn’t conform to historical daily trends. Those are exactly the hours that we would like to surface — significant departures from the status quo.
On Alternatives and Robustness
In order to look for the abnormal, we first must have a model of the normal. The natural choices for these kinds of data are methods that can capture the inherent periodic structure such as Fourier or wavelet techniques. However, many model-fitting techniques rely on minimizing sums of squares (e.g., the SVD) and are particularly sensitive to large outliers. Thus, we would like to restrict our fitting techniques to a class of robust estimators. Some examples include minimizing a different cost criterion such as a Huber loss or sum-absolute-value. Alternatively, one could maintain a set of discrete or continuous “robustness weights” according to each observation’s deviation from consensus. One such technique is seasonal trend loess, which attempts to decompose a time series into three components: a LOESS-smoothed trend that captures the slowly-varying long-term behavior, a seasonal component that only models periodicity, and a residual error term. In the example shown, there’s not a particularly compelling reason to choose RPCA over the other methods. The strengths will later become apparent when looking for horizontal anomalies and by being able to control the sparsity of the outlier set.
Comparisons with Non-Robust Methods
In order to apply RPCA, we need to construct our matrix. One strategy is to recognize the strong daily periodicity and exploit it by feeding in each day as a row or column of X. This can be formalized and automated to some extent by filling out the matrix according to periods which maximize some variant of the autocorrelation function. The fitting procedure will then seek to express each observed day as a multiple of a prototypical day (in the rank 1 setting) or as a linear combination of a few.
The blue curve represents the low-rank approximation after unrolling the L matrix. Visually, the blue curve does a pretty nice job of ignoring the outliers, which have collected into the matrix S (represented by red circles). Judging by the singular values, L is numerically rank 3 but morally rank 1. Because the traditional SVD attempts to fit the outliers into its low-rank model, the rank is unnecessarily inflated. By automatically discarding the discrepant points, the robust model is very nearly rank 1 instead of 5 or 10. The principal right singular vector is correspondingly affected: it has two spikes near hour 20 when compared with the smoother RPCA RSV, which looks more like an average daily trend. A true rank 1 fit would say that on March 17th, we saw the standard lull in the morning until 5AM UTC, but due to the monthly growth there were 65% as many edits as on the 26th.
In our next post, we will explore issues of horizontal detection, scale, and optimization strategies.