The Metamarkets solution allows for arbitrary exploration of massive data sets. Powered by Druid, our in-house distributed data store and processor, users can filter time series and top list queries based on Boolean expressions of dimension values. Given that some of our dataset dimensions contain millions of unique values, the subset of things that may match a particular filter expression may be quite large. To design for these challenges, we needed a fast and accurate (not a fast and approximate) solution, and we once again found ourselves buried under a stack of papers, looking for an answer.
May 4th, 2012 • Fangjin Yang
Filed in Algorithms, Druid
The nascent era of big data brings new challenges, which in turn require new tools and algorithms. At Metamarkets, one such challenge focuses on cardinality estimation: efficiently determining the number of distinct elements within a dimension of a large-scale data set. Cardinality estimations have a wide range of applications from monitoring network traffic to data mining. If leveraged correctly, these algorithms can also be used to provide insights into user engagement and growth, via metrics such as “daily active users.” The HyperLogLog Algorithm: Every Bit is Great It is well known that the cardinality of a large data set can […]