At Metamarkets, our product allows not only exploration of data at scale, it also enables discovery: algorithmically spotting interesting trends and putting these trends in front of decision-makers.
Using data from the public Twitter feed (representing up to 1% of tweets), we set out to discover the most surprising goings-on in the Twittersphere from April 9 to April 23. We do this by using the algorithm we introduced in our last post: Robust PCA. In order to apply this algorithm, we first need some idea of how to leverage potential structure in the data from which to make our claims about what is surprising. In the “vertical” case, we supposed that the day-to-day behavior of a time series was relatively consistent (rank 1), with significant departures indicating points of interest. Our model also allowed for some small number of prototypical signatures–weekends vs. weekdays (rank 2), or perhaps even one for every day of the week (rank 7).
Cross-Sectional Comparisons: Finding Needles in Twitter’s Haystacks
In the “horizontal” setting, we turn our sights outward and compare anomalies across many different types of time series, the idea being that different dimensions could share similar anomaly patterns. Taking retail stores as an example, we’ll do our grocery shopping relatively consistently throughout the year at Safeway, Trader Joe’s, Whole Foods, and our holiday shopping at Best Buy and Toys R Us with the expected year-end increases. Apple might see a similar pattern for most of the year, but when a new iPhone is released, we dutifully line up along with the rest of the world around that beautiful structure of glass and steel. That is the proverbial needle that we’d like to identify and raise above the chaff, the idea that for three-hundred-some-odd days of the year, the Apple store is well-approximated by your typical electronics retailer in terms of temporal buying patterns (if not profit margins). However, that is certainly not the case for those two or three annual blockbuster product releases. We use the common trends across industries to discount the expected seasonal variations in order to focus on the truly unique occurrences.
For the Twitter data, there are often big disparities between dimensions. Hashtags are typically associated with transient or irregular phenomena, as opposed to, say, the massive regularity of tweets emanating from a big country. Because of this greater degree of within-dimension similarity, we treat dimensions separately:
- Pick a set of dimensions to concentrate on: a tweet’s first hashtag (e.g. #itscrazyhow), URL domain (instagr.am), user mention name (@justinbieber), retweet name (@UberFacts), reply-to name (@ladygaga), user location (Brasil), and user time zone (Pacific Time).
- For each dimension, store the top 150 (by tweet volume) two week hourly time series in the rows of a matrix and apply a robust matrix factorization technique. We’re only considering one level down, but we could easily incorporate combinations of dimensions as well.
- Collect the results for all dimensions, and sort the results in descending order of some anomaly metric.
Black: time series; Blue: low-rank fit; Red: sparse anomaly; Green: residual error
Our analysis shows that hashtags and usernames dominate the list of anomalies as opposed to locations and timezones. Indeed, the low-rank models for hashtags and usernames fit nearly flat lines, indicating that there is very little similarity among the items in each of these groups; borrowing strength along the hashtag dimension doesn’t work because each hashtag signature is essentially unique. By comparison, the low rank model for URL domains exhibits a regular, daily periodicity, indicating a common pattern of more posts during the day and fewer at night among many of these posts. #ff shows up as the most anomalous because the other 149 hashtag time series do not only have Friday activity. This would not be classified as a “vertical” anomaly due to its predictable history.
Given so many anomalies, it still remains to make sense of them. What could have caused those massive upsurges in the otherwise-regular is.gd (a URL shortener) and twitcam.livestream.com (a Twitter live video streaming service) domains? Sorting by the magnitude of the anomaly yields a cursory and overly restricted view: there often exist correlations of the anomalies within and between dimensions. There can be much synergy between algorithms, and it is natural to apply some sort of clustering procedure for this next step.
The Hashtag Heard ‘Round the World: #HalaMadrid
By extracting the anomalies of a given time series into their own time series, we can find correlated patterns of anomalies across different dimensions. Here we present the 15 time series whose anomaly patterns are most correlated with the anomaly pattern of the hashtag #halamadrid:
Salmon-colored bands: sections deemed anomalous for #halamadrid
In this representation, the narrative almost writes itself. The soccer team Real Madrid played in several La Liga matches, including a particularly thrilling victory over Barcelona on April 21, owing to a winning goal by star player Cristiano Ronaldo. The games captured the attention of much of the world from Madrid to Kuwait (الكويت) to Guatemala. Looking closely at a location in a timezone different from the one in which the games were played (e.g. Guatemala), we see clear daily spikes in tweets. On Real Madrid game days, we see a secondary spike a few hours earlier, suggesting that a sizeable percentage of Guatemalan twitter users are fans of the team. (Note, the preceding paragraph was not algorithmically generated, though it is not hard to imagine creating a system to do so.)
Britain’s Next Bieber: Liam Payne
In a similar vein, let’s examine the preternaturally popular Liam Payne:
British heartthrob/boy band sensation Liam Payne made splashes during his April 18 appearance on twitcam.livestream.com (archived here) in the middle of One Direction’s grand Aussie/Kiwi tour. Even Narnia made an appearance on the list. Looking back at our first image, bandmate Niall Horan was not far behind in popularity.
Thanks go out to Matt Kraning for his useful insight and contributions to this post. In an upcoming post, we will peel back the curtain and give a more detailed tutorial about these algorithms.