This is a guest post by Matt Kraning, a Ph.D. candidate in the EE Department at Stanford University, and a colleague of Nelson Ray of Metamarkets.  Warning: there be math ahead.

In two previous posts (vertical and horizontal), we saw how the Robust PCA algorithm could be used to spot meaningful anomalies in Wikipedia and Twitter feeds.  This allowed us to find interesting correlations between a priori unconnected events.  In this post, we dive deeper into the mechanics of the algorithm itself and show how to scale it to find anomalies in massive datasets and tune its parameters for good performance.

From mathematical ideal to feasible computation

As a quick reminder, given a data matrix $X$ , we seek to decompose it as the sum of three simpler matrices: a low-rank matrix $L$  representing gross trends, a sparse matrix $S$  representing anomalies, and a matrix of small entries $E$  representing model error and noise. Using the language of mathematical optimization, we can write out a desired problem description for Robust PCA as

$$$\begin{array}{ll}\mbox{minimize} & \lambda_1{\bf rank}(L) + \lambda_2{\bf card}(S) + \frac{1}{2}||E||_F^2 \quad \mbox{subject to} & L+S+E = X,\end{array}$$$

where $L$ , $S$ , and $E$  are matrix variables that are the same size as our data matrix $X$ , and $\lambda_1$ and $\lambda_2$ are positive constants that serve to trade off between the competing terms of the objective to be minimized (we will show below how we choose the values for $\lambda_1$  and  $\lambda_2$ ).  We can interpret each of the three terms that we are looking to jointly minimize.

The first term, ${\bf rank}(L)$ , returns the rank of the matrix $L$ .  We want to encourage this term to be low-rank so that only the simplest, gross patterns are present.  The second term, ${\bf card}(S)$ , is equal to the number of non-zero elements of $S$ : this penalty encourages the term to be sparse, consisting of a few big outliers.  The final term $||E||_F^2$ is the (squared) Frobenius norm of the matrix $E$ , the sum of its squared entries.  It strongly penalizes large entries in $E$ , helping to ensure that our errors are not too large.  Thus, if we could solve this problem, we would get exactly what we want: a low-rank matrix $L$ , a sparse matrix $S$ , and an error matrix $E$ with small entries, all of which sum up to our data matrix $X$ .

Unfortunately, minimizing either the rank or cardinality of a matrix is NP-Hard, which means that we must use an approximation of both functions in order to make our problem computationally tractable.  The approach that Robust PCA takes is to replace these computationally troublesome functions by their convex relaxations.  Broadly speaking, if minimizing a function is NP-Hard, then minimizing its convex relaxation gives the “best” computationally tractable approximation to the original minimization problem. In the specific case of Robust PCA, the nuclear norm, $||L||_*$ , is the convex relaxation of ${\bf rank}(L)$ , and the $\ell_1$ -norm, $||S||_1$ , is the convex relaxation of ${\bf card}(S)$ .  The nuclear norm of a matrix is equal to the sum of its singular values (as opposed to simply the number of non-zero singular values, which is given by the rank function), and the $\ell_1$ -norm of a matrix is equal to the sum of the absolute values of its entries (rather than simply the number of non-zero entries, which is given by the cardinality function).

By making the requisite function substitutions, we get the RPCA problem

$$$\begin{array}{ll}\mbox{minimize} & \lambda_1 ||L||_* + \lambda_2 ||S||_1 + \frac{1}{2}||E||_F^2 \quad \mbox{subject to} & L+S+E = X, \end{array}$$$

which belongs to a class of problems known as convex optimization problems.  Practically, this means that there are many good algorithms to solve it efficiently, reliably, and quickly.  In the rest of this tutorial, we focus only on the tractable RPCA problem and not the original for which it is the proxy.

Data scale and algorithms: One size does not fit all

Algorithm design requires balancing many competing goals, three of the most prominent being time, space, and accuracy.  As the scale of our datasets changes, we must constantly re-examine our priorities in order to ensure that we get a result that is relatively fast, sufficiently accurate, and space-economical.

Small scale

If our data matrix $X$ is small enough, we can use off-the-shelf algorithms, such as interior point methods, to directly solve the Robust PCA problem very quickly and very accurately.  There are many good packages that support these routines, and they require little tuning to implement. However, these methods require space that scales quadratically with the size of our data, which makes them infeasible for problems with more than $10,000$ or so total data points.

Medium to large scale

The size of data we consider next is significantly larger than what can be handled using interior point methods, but small enough that it fits into the memory of a single computer. In this case, we can solve our problem using proximal gradient methods. These methods are iterative, and at each iteration require singular value thresholding (gives a modified low-rank version of its operand so as to expose gross structure) our current estimate of $L+E$ and soft thresholding (gives a shrunken-toward-zero sparser version of the operand) every element of our current estimate of $S+E$ (we will return to these operations in the next section when we discuss how we pick values for  $\lambda_1$ and $\lambda_2$ ). Computing each of these quantities allows us to iteratively improve our estimates of each matrix, eventually converging to the optimal values for all three matrices.  The first iteration gives us some sparse $S_1$ , some low-rank $L_1$ , and some dense error   $E_1$ .  With subsequent iterations, the respective matrices still retain their sparse, low-rank, or dense structure, but they also get closer and closer to their true, underlying values.  In addition, we can apply recent methods, such as Nesterov acceleration, that have been developed in the last few years which significantly speed up our algorithm.  Putting all this together, we can achieve good accuracy (well beyond the noise inherent in our model) in a time period of a few minutes using space that scales linearly with our dataset.

Huge scale

When datasets do not fit into the memory of a single computer, we need to modify our algorithm to process the data stored on each machine in parallel, and then merge the results in a computationally tractable way. New methods, such as divide and conquer matrix factorization, allow us to do just this for Robust PCA.  Although the theoretical correctness of these methods relies on certain randomness assumptions that do not necessarily hold for real data, in practice they nonetheless achieve reasonable accuracy.

Another approach is to simply use distributed versions of all three of the operations we used in the single machine case. Computing the singular values of $L+E$ is the most computationally intensive part of this process–the soft thresholding operation is easy to perform in parallel using a single MapReduce job. However, since we are only concerned with low-rank models, we only need to compute the top few singular values of $L+E$ , which we can do at scale by using power methods, which only require a multiplication by our current estimate of $L+E$ .  This is very similar to how Google computes the PageRank of web pages, and this matrix multiplication is easily performed by a MapReduce job (in fact, the necessity of this kind of computation at Google was one of the main driving forces behind the creation of MapReduce).

To motivate the choice of regularization parameters, we consider the updates to a block coordinate descent algorithm for solving the problem.  Let $L_i$ and  $L^{\star}$ be the values for our estimates of $L$ at iteration $i$ and the optimal value of $L$ for the Robust PCA problem, respectively, with similar notation for $S$  and $E$ .  Fixing $S_i$ , the update to $L_{i+1}$ is given by singular value thresholding (low-rank-encouraging) of $X-S_i$ by $\lambda_1$ , which we denote by $L_{i+1} = \mathcal{D}_{\lambda_1}(X-S_i)$ . Similarly, the update to  $S_{i+1}$ is given by element-wise soft thresholding (sparsity-encouraging) of  $X-L_{i+1}$ by the parameter $\lambda_2$ , which we denote by $S_{i+1} = \mathcal{S}_{\lambda_2}(X-L_{i+1})$ .

Suppose that $X^{\star} = L^{\star}+S^{\star}+E^{\star}$ and that $E^{\star}$ is filled with independent Normal noise with mean $0$ and known variance $\sigma^2$ . A prescription from Stable PCP is to set $\lambda_2 = \sqrt{2}\sigma$ . If $L_{i+1} = L^{\star}$ , then we have $S_{i+1}=\mathcal{S}_{\lambda_2}(S^{\star} + E^{\star})$ . The idea is to threshold just enough to get rid of most of the contributions of $E^{\star}$ while biasing our estimate of $S^{\star}$ as little as possible. We know that 84% of the mass of a Normal random variable lies within $\sqrt{2}\sigma$ of its mean, so we can expect $S_{i+1}$ to be roughly  $(0.16 + {\bf card}(S^{\star})/(m \times n))$ -sparse, since the noise in $E^{\star}$ is uncorrelated with both the magnitudes and locations of the anomalies in  $S^{\star}$ (this property is known as incoherence). The $\sqrt{2}$ can be adjusted to more closely match the desired sparsity level and tuned to the empirical noise distribution (perhaps Laplacian rather than Normal).

Similarly, if in iteration $j$ we have $S_j = S^{\star}$ , then $L_{j+1}=\mathcal{D}_{\lambda_1}(L^{\star} + E^{\star})$ . Let $||\cdot||_2$ denote the spectral norm, or the maximum singular value, of a matrix.  Our strategy in this case is to singular value threshold enough to get rid of the noise without overly biasing our estimate of $L^{\star}$ .  The i.i.d. model for $E^{\star}$ means that it is drawn from a real Gaussian matrix ensemble.  Using results from random matrix theory, it is known that $||E^{\star}||_2$ is concentrated in the range $[2\sigma \sqrt{n} - O(n^{-1/6}), 2\sigma \sqrt{n}+O(n^{-1/6})]$ for $n \times n$ matrices $E^\star$ , with a similar expression for non-square matrices. By singular value thresholding  $L^{\star}+E^{\star}$ by this amount, we expect to recover a good, low-rank approximation to $L^{\star}$ , since the noise subspaces of $E^{\star}$ should be uncorrelated with both the singular values and singular vectors of $L^{\star}$ , and thus be well filtered by our singular value thresholding.

It only remains to estimate the noise level of $E^{\star}$ , $\sigma^2$ . One method is to let $\hat{\sigma}^2 = \sigma^2(E_i)$ . That is, we can use the empirical variance of our estimate of $E^{\star}$ in each iteration.  Our estimate changes in each iteration (albeit slowly near the end of our algorithm), and this strategy works quite well in practice.

Correcting for the approximation

If you look closely at the plots from the last post, you will see that the green error curve seems to have spikes at the same locations where our anomaly detection matrix $S$ was non-zero. Another way of putting this is that there appear to be “residual anomalies” in that our detector is biased and generally underestimates the full extent of each anomaly, leaving some remainder of it as an error, and another part biasing our low-rank model. This is no coincidence, but in fact an artifact of the approximation problem Robust PCA is actually solving.  As described above, we would like to minimize the rank of $L$ and the cardinality of $S$ , but for computationally tractability have to settle for minimizing these functions’ convex relaxations instead.

However, all is not lost, as we can effectively eliminate this bias. The reason why we systematically underestimate anomaly magnitudes is that large anomalies are penalized much more under the $\ell_1$ -norm than under cardinality penalization, even though we would like all anomalies to be penalized equally, independent of their size. One way we can fix this problem is through a two-phase procedure known as debiasing. In the first phase, we solve the Robust PCA problem in order to get the sparsity pattern $S$ , i.e., to find the anomaly locations. In the second phase, we solve the problem

$$$\begin{array}{ll}\mbox{minimize} & \lambda_1||L||_* + ||E||_F^2 \quad \mbox{subject to} & L+S+E = X,\mathcal{P}_{\Omega}(S) = 0,\end{array}$$$

where the expression $\mathcal{P}_{\Omega}(S) = 0$ restricts all of the entries of $S$ that were not found to be anomalous in the first stage to be zero. Since we are no longer penalizing the size the anomalies in $S$ in the second phase, all anomalies are treated equally, independent of their size. As seen in the figure below, when we perform this two phase process, we stop systematically underestimating the magnitude of anomalies and get more accurate estimates of their true size.

An example of debiasing.  Without running the second phase of the algorithm, the Phase 1 solutions are generally shrunk away from the true underlying signal value, while the debiased, Phase 2 solutions are much closer to the orignal signal.

Robust PCA is a very useful data analysis technique that can be used to discover new and interesting connections in many different datasets.  The ideas presented here span many recent and advanced research papers, and we hope you enjoyed our whirlwind tour through them.