# Maximum Performance with Minimum Storage: Data Compression in Druid

The Metamarkets solution allows for arbitrary exploration of massive data sets. Powered by Druid, our in-house distributed data store and processor, users can filter time series and top list queries based on Boolean expressions of dimension values. Given that some of our dataset dimensions contain millions of unique values, the subset of things that may match a particular filter expression may be quite large. To design for these challenges, we needed a fast and accurate (not a fast and approximate) solution, and we once again found ourselves buried under a stack of papers, looking for an answer.

From Justin Bieber to Ones and Zeros

To better understand how Druid stores dimension values, consider the following data set:

 Timestamp Publisher Advertiser Gender Country Impressions Clicks Revenue 2011-01-01T01:00:00Z bieberfever.com google.com Male USA 1800 25 15.70 2011-01-01T01:00:00Z bieberfever.com google.com Male USA 2912 42 29.18 2011-01-01T02:00:00Z ultratrimfast.com google.com Male USA 1953 17 17.31 2011-01-01T02:00:00Z ultratrimfast.com google.com Male USA 3194 170 34.01

Consider the publisher dimension (column) in the table above. For each unique publisher, we can form some representation indicating in which table rows a particular publisher is seen. We can store this information in a binary array where the array indices represent our rows. If a particular publisher is seen in a certain row, that array index is marked as ‘1’. For example:

Bieberfever.com -> [1, 2] -> [1][1][0][0]

Ultratrimfast.com -> [3, 4] -> [0][0][1][1]

In the example above bieberfever.com is seen in rows 1 and 2. This mapping of dimension values to row indices forms an inverted index and is in fact how we store dimension information in Druid. If we want to know which rows contain bieberfever.com OR ultratrimfast.com, we can OR together the bieberfever.com and ultratrimfast.com arrays.

[0][1][0][1] OR [1][0][1][0] = [1][1][1][1]

This idea forms the basis of how to perform Boolean operations on large bitmap sets. A challenge still remains in that if each array consisted of millions or billions of entries and if we had to OR together millions of such arrays, performance can potentially become a major issue. Thankfully for us, most bitmap indices are either very sparse or very dense, which is something that can be leveraged for compression.

Bit arrays, or bitmaps, are frequently employed in areas such as data warehousing and data mining to significantly reduce storage costs. Bitmap compression algorithms are a well-defined area of research and often utilize run-length encoding. Well known algorithms include Byte-aligned Bitmap Code, Word-Aligned Hybrid (WAH) code, and Partitioned Word-Aligned Hybrid (PWAH) compression.

A Concise Solution

Most word-aligned run-length encoding algorithms represent long sequences of ones and zeros in a single word. The word contains the length of the sequence and some information about whether it is a one fill or a zero fill. Sequences that contain a mixture of 0 and 1 bits are stored in 32 bit blocks known as literals. An example of word-aligned hybrid compression is shown below:

Given a bitstream: [10110…1][000…010][010…011]

There are three separate 32 bit sequences in the bitstream.

1) [1]0110…1 – 31 “dirty” bits (a literal)

2) [00]0…010 – 31 x 2 zeros (a sequence of zeros)

3) [01]0…011 – 31 x 3 ones (a sequences of ones)

Concise bitmap compression introduces the concept of a mixed fill, where fills and literals can be represented in a single word. The author of the original Concise paper claims that Concise outperforms WAH by reducing the size of the compressed bitmaps by up to 50%. For mixed fill sequences, the first 2 bits indicate the type of fill (0 or 1). The next 5 bits can be used to indicate the position where bits flip from 0 to 1 or vice versa. An example of the Concise representation for the integer set {3, 5, 31-93, 1,024, 1,028, 1,040,187,422} is shown below:

1) [1]0…101000

2) [01][00000]0…01

3) [00][00001]0…11101

4) [1]0…100010

5)[00][00000]1…1011101

6)[1]10…0

Efficiency at Scale

Although Concise compression can greatly reduce the size of resulting bitmaps, we still have the problem of performing efficient Boolean operations on top of a large number of Concise sets. Luckily, Concise sets share a very important property with other bitmap compression schemes: they can be operated on in their compressed form.  The Boolean operations we care about are AND, OR, and NOT. The NOT operation is the most straightforward to implement. Literals are directly complemented and fills of zeros and ones are inverted. ANDing and ORing sets prove to be more challenging.

Consider ORing two sets where one set is a long sequence of ones and the other set contains a shorter sequence of ones, a sequence of zeros, and some literals. If the sequence of ones in the first set is sufficiently long enough to encompass the second set, we don’t need to care about the second set at all (yay for Boolean logic!). Hence, when ORing sets, sequences of ones always have priority over sequences of zeros and literals. Similarly, a sequence of zeros can be ignored; the sequence contributes nothing to the overall Boolean logic. Extending this idea further with n sets, we can examine the first starting word of every set and determine if a one fill exists. If so, we can find the longest one fill and advance all the sets forward past this one fill. At this new stopping point, we repeat the search for the longest one fill. If no such fill exists, we search for all literals at that position, OR the literals together, and continue advancing forward. The same idea applies for ANDing sets, except now sequences of zeros have the highest priority. This pseudo-algorithm, combined with some additional logic to address mixed fills, was found to be sufficient to address our performance requirements.

Results

The following results were generated on a cc2.8xlarge system with a single thread, 2G heap, 512m young gen, and a forced GC between each run. The data set is a single day’s worth of data collected from the Twitter garden hose data stream. The data set contains 2, 272, 295 rows. The table below demonstrates a size comparison between Concise compressed sets and regular integer arrays for different dimensions.

 Dimension Cardinality Concise compressed size (bytes) Integer array size (bytes) Concise size as a % of integer array size Has_mention 2 586,400 9,089,180 6.451627 Has_links 2 580,872 9,089,180 6.390808 Has_geo 2 144,004 9,089,180 1.584345 Is_retweet 2 584,592 9,089,180 6.431735 Is_viral 2 358,380 9,089,180 3.942930 User_lang 21 1,414,000 9,089,180 15.556959 User_time_zone 142 3,876,244 9,089,180 42.646795 URL_domain 31,165 1,562,428 9,089,180 17.189978 First_hashtag 100,728 1,837,144 9,089,180 20.212428 Rt_name 182,704 2,235,288 9,089,180 24.592846 Reply_to_name 620,421 5,673,504 9,089,180 62.420416 User_location 637,774 9,511,844 9,089,180 104.650188 User_mention_name 923,842 9,086,416 9,089,180 99.969590 User_name 1,784,369 16,000,028 9,089,180 176.033790

Total concise compressed size = 53, 451, 144 bytes

Total integer array size = 127, 248, 520 bytes

Overall, Concise compressed sets are about 42.005317% less than integer arrays.

We also resorted the rows of the data set to maximize compression to see how the results would be affected.

 Dimension Cardinality Concise compressed size (bytes) Integer array size (bytes) Concise size as a % of integer array size Has_mention 2 744 9,089,180 0.008186 Has_links 2 1,504 9,089,180 0.016547 Has_geo 2 2,840 9,089,180 0.031246 Is_retweet 2 1,616 9,089,180 0.017779 Is_viral 2 1,488 9,089,180 0.016371 User_lang 21 38,416 9,089,180 0.422656 User_time_zone 142 319,644 9,089,180 3.516753 URL_domain 31,165 700,752 9,089,180 7.709738 First_hashtag 100,728 1,505,292 9,089,180 16.561362 Rt_name 182,704 1,874,180 9,089,180 20.619902 Reply_to_name 620,421 5,404,108 9,089,180 59.456497 User_location 637,774 9,091,016 9,089,180 100.075340 User_mention_name 923,842 8,686,384 9,089,180 95.568401 User_name 1,784,369 16,204,900 9,089,180 178.287810

Total concise compressed size = 43,832,884 bytes

Total integer array size = 127, 248, 520 bytes

What is interesting to note is that after sorting, global compression only increased minimally. The total Concise set size to total integer array size is 34.448031%.

To understand the performance implications of using Concise sets versus integer arrays, we choose several dimensions from our data set with varying cardinalities and generated Concise sets for every dimension value of every selected dimension. The histograms below indicate the size distribution of the generated Concise sets for a given dimension. Each test run randomly picked a given number of Concise sets and performed Boolean operations with them.  Integer array representations of these Concise sets were then created and the same Boolean operations were run on the integer arrays. There were 100 runs per test case and the average run time required to perform a Boolean operation is shown for each dimension in the first of two tables below. The second table shows the performance results when the sets used in the Boolean operation alway include the largest (size) set of the dimension.

Dimension: User_time_zone

Cardinality: 142

 Number of filter elements OR operation with Concise set (ms) OR operation with integer arrays (ms) AND operations with Concise set (ms) AND operation with integer arrays (ms) 10 31 20 1 0 25 66 53 2 0 50 159 153 4 0 100 339 322 7 0

Always including the largest Concise set of the dimension:

 Number of filter elements OR operation with Concise set (ms) OR operation with integer arrays (ms) AND operations with Concise set (ms) AND operation with integer arrays (ms) 10 44 77 1 0 25 92 141 2 0 50 184 223 4 0 100 398 419 8 0

Dimension: URL_domain

Cardinality: 31,165

 Number of filter elements OR operation with Concise set (ms) OR operation with integer arrays (ms) AND operations with Concise set (ms) AND operation with integer arrays (ms) 10 0 0 0 0 25 0 2 0 0 50 0 0 0 0 100 0 0 0 0 1,000 8 24 0 1 5,000 54 132 3 57 10,000 111 286 8 284 25,000 348 779 22 1,925

Always including the largest Concise set of the dimension:

 Number of filter elements OR operation with Concise set (ms) OR operation with integer arrays (ms) AND operations with Concise set (ms) AND operation with integer arrays (ms) 10 14 172 0 0 25 17 242 0 0 50 19 298 0 0 100 22 356 0 0 1,000 35 569 1 1 5,000 89 865 4 59 10,000 158 1,050 9 289 25,000 382 1,618 21 1,949

Dimension: RT_name

Cardinality: 182,704

 Number of filter elements OR operation with Concise set (ms) OR operation with integer arrays (ms) AND operations with Concise set (ms) AND operation with integer arrays (ms) 10 0 0 0 0 25 0 0 0 0 50 0 0 0 0 100 0 0 0 0 1,000 1 0 0 1 5,000 11 31 3 57 10,000 25 68 7 284 25,000 98 118 20 1,925 50,000 224 292 – – 100,000 521 727 – –

Note: for AND operations on 50,000+ items, our implementation of the array based approach produced StackOverflow exceptions.  Instead of changing the implementation to something that didn’t, we just decided not to do comparisons beyond that point.

Always including the largest Concise set of the dimension:

 Number of filter elements OR operation with Concise set (ms) OR operation with integer arrays (ms) AND operations with Concise set (ms) AND operation with integer arrays (ms) 10 14 168 0 0 25 16 236 0 0 50 18 289 0 0 100 20 348 0 0 1,000 29 551 1 1 5,000 44 712 4 59 10,000 69 817 8 289 25,000 161 986 20 1,949 50,000 303 1,182 – –

Dimension: User_location

Cardinality: 637,774

 Number of filter elements OR operation with Concise set (ms) OR operation with integer arrays (ms) AND operations with Concise set (ms) AND operation with integer arrays (ms) 10 0 0 0 0 25 0 0 0 0 50 0 0 0 0 100 0 0 0 0 1,000 2 0 0 1 5,000 15 7 3 57 10,000 34 16 8 284 25,000 138 54 21 1,927 50,000 298 128 – – 100,000 650 271 – – 250,000 1,695 881 – – 500,000 3,433 2,311 – –

Always including the largest Concise set of the dimension:

 Number of filter elements OR operation with Concise set (ms) OR operation with integer arrays (ms) AND operations with Concise set (ms) AND operation with integer arrays (ms) 10 14 47 0 0 25 16 67 0 0 50 18 80 0 0 100 20 97 0 0 1,000 29 153 1 1 5,000 48 206 4 59 10,000 81 233 9 290 25,000 190 294 21 1,958 50,000 359 378 – –

Dimension: User_name

Cardinality: 1,784,369

 Number of filter elements OR operation with Concise set (ms) OR operation with integer arrays (ms) AND operations with Concise set (ms) AND operation with integer arrays (ms) 10 0 0 0 0 25 0 0 0 0 50 0 0 0 0 100 0 0 0 0 1,000 1 0 0 1 5,000 7 2 3 57 10,000 17 6 7 283 25,000 74 19 21 1,928 50,000 177 45 – – 100,000 440 108 – – 250,000 1,225 379 – – 500,000 2,504 978 – – 1,000,000 5,076 2,460 – – 1,250,000 6,331 3,265 – – 1,500,000 7,622 4,036 – – 1,750,000 8,911 4,982 – –

Always including the largest Concise set of the dimension:

 Number of filter elements OR operation with Concise set (ms) OR operation with integer arrays (ms) AND operations with Concise set (ms) AND operation with integer arrays (ms) 10 0 0 0 0 25 0 0 0 0 50 0 0 0 0 100 0 0 0 0 1,000 1 0 0 1 5,000 8 2 3 59 10,000 22 6 7 289 25,000 77 19 21 1,954 50,000 196 45 – –