Maximum Performance with Minimum Storage: Data Compression in Druid
September 21st, 2012 Fangjin Yang
The Metamarkets solution allows for arbitrary exploration of massive data sets. Powered by Druid, our in-house distributed data store and processor, users can filter time series and top list queries based on Boolean expressions of dimension values. Given that some of our dataset dimensions contain millions of unique values, the subset of things that may match a particular filter expression may be quite large. To design for these challenges, we needed a fast and accurate (not a fast and approximate) solution, and we once again found ourselves buried under a stack of papers, looking for an answer.
From Justin Bieber to Ones and Zeros
To better understand how Druid stores dimension values, consider the following data set:
Timestamp | Publisher | Advertiser | Gender | Country | Impressions | Clicks | Revenue |
2011-01-01T01:00:00Z
|
bieberfever.com
|
google.com
|
Male
|
USA
|
1800
|
25
|
15.70
|
2011-01-01T01:00:00Z
|
bieberfever.com |
google.com
|
Male
|
USA
|
2912
|
42
|
29.18
|
2011-01-01T02:00:00Z
|
ultratrimfast.com
|
google.com
|
Male
|
USA
|
1953
|
17
|
17.31
|
2011-01-01T02:00:00Z
|
ultratrimfast.com
|
google.com
|
Male
|
USA
|
3194
|
170
|
34.01
|
Consider the publisher dimension (column) in the table above. For each unique publisher, we can form some representation indicating in which table rows a particular publisher is seen. We can store this information in a binary array where the array indices represent our rows. If a particular publisher is seen in a certain row, that array index is marked as ‘1’. For example:
Bieberfever.com -> [1, 2] -> [1][1][0][0]
Ultratrimfast.com -> [3, 4] -> [0][0][1][1]
In the example above bieberfever.com is seen in rows 1 and 2. This mapping of dimension values to row indices forms an inverted index and is in fact how we store dimension information in Druid. If we want to know which rows contain bieberfever.com OR ultratrimfast.com, we can OR together the bieberfever.com and ultratrimfast.com arrays.
[0][1][0][1] OR [1][0][1][0] = [1][1][1][1]
This idea forms the basis of how to perform Boolean operations on large bitmap sets. A challenge still remains in that if each array consisted of millions or billions of entries and if we had to OR together millions of such arrays, performance can potentially become a major issue. Thankfully for us, most bitmap indices are either very sparse or very dense, which is something that can be leveraged for compression.
Bit arrays, or bitmaps, are frequently employed in areas such as data warehousing and data mining to significantly reduce storage costs. Bitmap compression algorithms are a well-defined area of research and often utilize run-length encoding. Well known algorithms include Byte-aligned Bitmap Code, Word-Aligned Hybrid (WAH) code, and Partitioned Word-Aligned Hybrid (PWAH) compression.
A Concise Solution
Most word-aligned run-length encoding algorithms represent long sequences of ones and zeros in a single word. The word contains the length of the sequence and some information about whether it is a one fill or a zero fill. Sequences that contain a mixture of 0 and 1 bits are stored in 32 bit blocks known as literals. An example of word-aligned hybrid compression is shown below:
Given a bitstream: [10110…1][000…010][010…011]
There are three separate 32 bit sequences in the bitstream.
1) [1]0110…1 – 31 “dirty” bits (a literal)
2) [00]0…010 – 31 x 2 zeros (a sequence of zeros)
3) [01]0…011 – 31 x 3 ones (a sequences of ones)
Concise bitmap compression introduces the concept of a mixed fill, where fills and literals can be represented in a single word. The author of the original Concise paper claims that Concise outperforms WAH by reducing the size of the compressed bitmaps by up to 50%. For mixed fill sequences, the first 2 bits indicate the type of fill (0 or 1). The next 5 bits can be used to indicate the position where bits flip from 0 to 1 or vice versa. An example of the Concise representation for the integer set {3, 5, 31-93, 1,024, 1,028, 1,040,187,422} is shown below:
1) [1]0…101000
2) [01][00000]0…01
3) [00][00001]0…11101
4) [1]0…100010
5)[00][00000]1…1011101
6)[1]10…0
Efficiency at Scale
Although Concise compression can greatly reduce the size of resulting bitmaps, we still have the problem of performing efficient Boolean operations on top of a large number of Concise sets. Luckily, Concise sets share a very important property with other bitmap compression schemes: they can be operated on in their compressed form. The Boolean operations we care about are AND, OR, and NOT. The NOT operation is the most straightforward to implement. Literals are directly complemented and fills of zeros and ones are inverted. ANDing and ORing sets prove to be more challenging.
Consider ORing two sets where one set is a long sequence of ones and the other set contains a shorter sequence of ones, a sequence of zeros, and some literals. If the sequence of ones in the first set is sufficiently long enough to encompass the second set, we don’t need to care about the second set at all (yay for Boolean logic!). Hence, when ORing sets, sequences of ones always have priority over sequences of zeros and literals. Similarly, a sequence of zeros can be ignored; the sequence contributes nothing to the overall Boolean logic. Extending this idea further with n sets, we can examine the first starting word of every set and determine if a one fill exists. If so, we can find the longest one fill and advance all the sets forward past this one fill. At this new stopping point, we repeat the search for the longest one fill. If no such fill exists, we search for all literals at that position, OR the literals together, and continue advancing forward. The same idea applies for ANDing sets, except now sequences of zeros have the highest priority. This pseudo-algorithm, combined with some additional logic to address mixed fills, was found to be sufficient to address our performance requirements.
Results
The following results were generated on a cc2.8xlarge system with a single thread, 2G heap, 512m young gen, and a forced GC between each run. The data set is a single day’s worth of data collected from the Twitter garden hose data stream. The data set contains 2, 272, 295 rows. The table below demonstrates a size comparison between Concise compressed sets and regular integer arrays for different dimensions.
Dimension | Cardinality | Concise compressed size (bytes) | Integer array size (bytes) | Concise size as a % of integer array size |
Has_mention | 2 | 586,400 | 9,089,180 | 6.451627 |
Has_links | 2 | 580,872 | 9,089,180 | 6.390808 |
Has_geo | 2 | 144,004 | 9,089,180 | 1.584345 |
Is_retweet | 2 | 584,592 | 9,089,180 | 6.431735 |
Is_viral | 2 | 358,380 | 9,089,180 | 3.942930 |
User_lang | 21 | 1,414,000 | 9,089,180 | 15.556959 |
User_time_zone | 142 | 3,876,244 | 9,089,180 | 42.646795 |
URL_domain | 31,165 | 1,562,428 | 9,089,180 | 17.189978 |
First_hashtag | 100,728 | 1,837,144 | 9,089,180 | 20.212428 |
Rt_name | 182,704 | 2,235,288 | 9,089,180 | 24.592846 |
Reply_to_name | 620,421 | 5,673,504 | 9,089,180 | 62.420416 |
User_location | 637,774 | 9,511,844 | 9,089,180 | 104.650188 |
User_mention_name | 923,842 | 9,086,416 | 9,089,180 | 99.969590 |
User_name | 1,784,369 | 16,000,028 | 9,089,180 | 176.033790 |
Total concise compressed size = 53, 451, 144 bytes
Total integer array size = 127, 248, 520 bytes
Overall, Concise compressed sets are about 42.005317% less than integer arrays.
We also resorted the rows of the data set to maximize compression to see how the results would be affected.
Dimension | Cardinality | Concise compressed size (bytes) | Integer array size (bytes) | Concise size as a % of integer array size |
Has_mention | 2 | 744 | 9,089,180 | 0.008186 |
Has_links | 2 | 1,504 | 9,089,180 | 0.016547 |
Has_geo | 2 | 2,840 | 9,089,180 | 0.031246 |
Is_retweet | 2 | 1,616 | 9,089,180 | 0.017779 |
Is_viral | 2 | 1,488 | 9,089,180 | 0.016371 |
User_lang | 21 | 38,416 | 9,089,180 | 0.422656 |
User_time_zone | 142 | 319,644 | 9,089,180 | 3.516753 |
URL_domain | 31,165 | 700,752 | 9,089,180 | 7.709738 |
First_hashtag | 100,728 | 1,505,292 | 9,089,180 | 16.561362 |
Rt_name | 182,704 | 1,874,180 | 9,089,180 | 20.619902 |
Reply_to_name | 620,421 | 5,404,108 | 9,089,180 | 59.456497 |
User_location | 637,774 | 9,091,016 | 9,089,180 | 100.075340 |
User_mention_name | 923,842 | 8,686,384 | 9,089,180 | 95.568401 |
User_name | 1,784,369 | 16,204,900 | 9,089,180 | 178.287810 |
Total concise compressed size = 43,832,884 bytes
Total integer array size = 127, 248, 520 bytes
What is interesting to note is that after sorting, global compression only increased minimally. The total Concise set size to total integer array size is 34.448031%.
To understand the performance implications of using Concise sets versus integer arrays, we choose several dimensions from our data set with varying cardinalities and generated Concise sets for every dimension value of every selected dimension. The histograms below indicate the size distribution of the generated Concise sets for a given dimension. Each test run randomly picked a given number of Concise sets and performed Boolean operations with them. Integer array representations of these Concise sets were then created and the same Boolean operations were run on the integer arrays. There were 100 runs per test case and the average run time required to perform a Boolean operation is shown for each dimension in the first of two tables below. The second table shows the performance results when the sets used in the Boolean operation alway include the largest (size) set of the dimension.
Dimension: User_time_zone
Cardinality: 142
Number of filter elements | OR operation with Concise set (ms) | OR operation with integer arrays (ms) | AND operations with Concise set (ms) | AND operation with integer arrays (ms) |
10 | 31 | 20 | 1 | 0 |
25 | 66 | 53 | 2 | 0 |
50 | 159 | 153 | 4 | 0 |
100 | 339 | 322 | 7 | 0 |
Always including the largest Concise set of the dimension:
Number of filter elements | OR operation with Concise set (ms) | OR operation with integer arrays (ms) | AND operations with Concise set (ms) | AND operation with integer arrays (ms) |
10 | 44 | 77 | 1 | 0 |
25 | 92 | 141 | 2 | 0 |
50 | 184 | 223 | 4 | 0 |
100 | 398 | 419 | 8 | 0 |
Dimension: URL_domain
Cardinality: 31,165
Number of filter elements | OR operation with Concise set (ms) | OR operation with integer arrays (ms) | AND operations with Concise set (ms) | AND operation with integer arrays (ms) |
10 | 0 | 0 | 0 | 0 |
25 | 0 | 2 | 0 | 0 |
50 | 0 | 0 | 0 | 0 |
100 | 0 | 0 | 0 | 0 |
1,000 | 8 | 24 | 0 | 1 |
5,000 | 54 | 132 | 3 | 57 |
10,000 | 111 | 286 | 8 | 284 |
25,000 | 348 | 779 | 22 | 1,925 |
Always including the largest Concise set of the dimension:
Number of filter elements | OR operation with Concise set (ms) | OR operation with integer arrays (ms) | AND operations with Concise set (ms) | AND operation with integer arrays (ms) |
10 | 14 | 172 | 0 | 0 |
25 | 17 | 242 | 0 | 0 |
50 | 19 | 298 | 0 | 0 |
100 | 22 | 356 | 0 | 0 |
1,000 | 35 | 569 | 1 | 1 |
5,000 | 89 | 865 | 4 | 59 |
10,000 | 158 | 1,050 | 9 | 289 |
25,000 | 382 | 1,618 | 21 | 1,949 |
Dimension: RT_name
Cardinality: 182,704
Number of filter elements | OR operation with Concise set (ms) | OR operation with integer arrays (ms) | AND operations with Concise set (ms) | AND operation with integer arrays (ms) |
10 | 0 | 0 | 0 | 0 |
25 | 0 | 0 | 0 | 0 |
50 | 0 | 0 | 0 | 0 |
100 | 0 | 0 | 0 | 0 |
1,000 | 1 | 0 | 0 | 1 |
5,000 | 11 | 31 | 3 | 57 |
10,000 | 25 | 68 | 7 | 284 |
25,000 | 98 | 118 | 20 | 1,925 |
50,000 | 224 | 292 | – | – |
100,000 | 521 | 727 | – | – |
Note: for AND operations on 50,000+ items, our implementation of the array based approach produced StackOverflow exceptions. Instead of changing the implementation to something that didn’t, we just decided not to do comparisons beyond that point.
Always including the largest Concise set of the dimension:
Number of filter elements | OR operation with Concise set (ms) | OR operation with integer arrays (ms) | AND operations with Concise set (ms) | AND operation with integer arrays (ms) |
10 | 14 | 168 | 0 | 0 |
25 | 16 | 236 | 0 | 0 |
50 | 18 | 289 | 0 | 0 |
100 | 20 | 348 | 0 | 0 |
1,000 | 29 | 551 | 1 | 1 |
5,000 | 44 | 712 | 4 | 59 |
10,000 | 69 | 817 | 8 | 289 |
25,000 | 161 | 986 | 20 | 1,949 |
50,000 | 303 | 1,182 | – | – |
Dimension: User_location
Cardinality: 637,774
Number of filter elements | OR operation with Concise set (ms) | OR operation with integer arrays (ms) | AND operations with Concise set (ms) | AND operation with integer arrays (ms) |
10 | 0 | 0 | 0 | 0 |
25 | 0 | 0 | 0 | 0 |
50 | 0 | 0 | 0 | 0 |
100 | 0 | 0 | 0 | 0 |
1,000 | 2 | 0 | 0 | 1 |
5,000 | 15 | 7 | 3 | 57 |
10,000 | 34 | 16 | 8 | 284 |
25,000 | 138 | 54 | 21 | 1,927 |
50,000 | 298 | 128 | – | – |
100,000 | 650 | 271 | – | – |
250,000 | 1,695 | 881 | – | – |
500,000 | 3,433 | 2,311 | – | – |
Always including the largest Concise set of the dimension:
Number of filter elements | OR operation with Concise set (ms) | OR operation with integer arrays (ms) | AND operations with Concise set (ms) | AND operation with integer arrays (ms) |
10 | 14 | 47 | 0 | 0 |
25 | 16 | 67 | 0 | 0 |
50 | 18 | 80 | 0 | 0 |
100 | 20 | 97 | 0 | 0 |
1,000 | 29 | 153 | 1 | 1 |
5,000 | 48 | 206 | 4 | 59 |
10,000 | 81 | 233 | 9 | 290 |
25,000 | 190 | 294 | 21 | 1,958 |
50,000 | 359 | 378 | – | – |
Dimension: User_name
Cardinality: 1,784,369
Number of filter elements | OR operation with Concise set (ms) | OR operation with integer arrays (ms) | AND operations with Concise set (ms) | AND operation with integer arrays (ms) |
10 | 0 | 0 | 0 | 0 |
25 | 0 | 0 | 0 | 0 |
50 | 0 | 0 | 0 | 0 |
100 | 0 | 0 | 0 | 0 |
1,000 | 1 | 0 | 0 | 1 |
5,000 | 7 | 2 | 3 | 57 |
10,000 | 17 | 6 | 7 | 283 |
25,000 | 74 | 19 | 21 | 1,928 |
50,000 | 177 | 45 | – | – |
100,000 | 440 | 108 | – | – |
250,000 | 1,225 | 379 | – | – |
500,000 | 2,504 | 978 | – | – |
1,000,000 | 5,076 | 2,460 | – | – |
1,250,000 | 6,331 | 3,265 | – | – |
1,500,000 | 7,622 | 4,036 | – | – |
1,750,000 | 8,911 | 4,982 | – | – |
Always including the largest Concise set of the dimension:
Number of filter elements | OR operation with Concise set (ms) | OR operation with integer arrays (ms) | AND operations with Concise set (ms) | AND operation with integer arrays (ms) |
10 | 0 | 0 | 0 | 0 |
25 | 0 | 0 | 0 | 0 |
50 | 0 | 0 | 0 | 0 |
100 | 0 | 0 | 0 | 0 |
1,000 | 1 | 0 | 0 | 1 |
5,000 | 8 | 2 | 3 | 59 |
10,000 | 22 | 6 | 7 | 289 |
25,000 | 77 | 19 | 21 | 1,954 |
50,000 | 196 | 45 | – | – |