Maximum Performance with Minimum Storage: Data Compression in Druid

The Metamarkets solution allows for arbitrary exploration of massive data sets. Powered by Druid, our in-house distributed data store and processor, users can filter time series and top list queries based on Boolean expressions of dimension values. Given that some of our dataset dimensions contain millions of unique values, the subset of things that may match a particular filter expression may be quite large. To design for these challenges, we needed a fast and accurate (not a fast and approximate) solution, and we once again found ourselves buried under a stack of papers, looking for an answer.

From Justin Bieber to Ones and Zeros

To better understand how Druid stores dimension values, consider the following data set:

Timestamp Publisher Advertiser Gender Country Impressions Clicks Revenue
2011-01-01T01:00:00Z

 

bieberfever.com

 

google.com

 

Male

 

USA

 

1800

 

25

 

15.70

 

2011-01-01T01:00:00Z

 

bieberfever.com
google.com

 

Male

 

USA

 

2912

 

42

 

 

29.18

 

2011-01-01T02:00:00Z

 

ultratrimfast.com

 

google.com

 

Male

 

USA

 

1953

 

17

 

17.31

 

2011-01-01T02:00:00Z

 

ultratrimfast.com

 

google.com

 

Male

 

USA

 

3194

 

170

 

34.01

 

 

 

Consider the publisher dimension (column) in the table above. For each unique publisher, we can form some representation indicating in which table rows a particular publisher is seen. We can store this information in a binary array where the array indices represent our rows. If a particular publisher is seen in a certain row, that array index is marked as ‘1’. For example:

Bieberfever.com -> [1, 2] -> [1][1][0][0]

Ultratrimfast.com -> [3, 4] -> [0][0][1][1]

In the example above bieberfever.com is seen in rows 1 and 2. This mapping of dimension values to row indices forms an inverted index and is in fact how we store dimension information in Druid. If we want to know which rows contain bieberfever.com OR ultratrimfast.com, we can OR together the bieberfever.com and ultratrimfast.com arrays.

[0][1][0][1] OR [1][0][1][0] = [1][1][1][1]

This idea forms the basis of how to perform Boolean operations on large bitmap sets. A challenge still remains in that if each array consisted of millions or billions of entries and if we had to OR together millions of such arrays, performance can potentially become a major issue. Thankfully for us, most bitmap indices are either very sparse or very dense, which is something that can be leveraged for compression.

Bit arrays, or bitmaps, are frequently employed in areas such as data warehousing and data mining to significantly reduce storage costs. Bitmap compression algorithms are a well-defined area of research and often utilize run-length encoding. Well known algorithms include Byte-aligned Bitmap Code, Word-Aligned Hybrid (WAH) code, and Partitioned Word-Aligned Hybrid (PWAH) compression.

A Concise Solution

Most word-aligned run-length encoding algorithms represent long sequences of ones and zeros in a single word. The word contains the length of the sequence and some information about whether it is a one fill or a zero fill. Sequences that contain a mixture of 0 and 1 bits are stored in 32 bit blocks known as literals. An example of word-aligned hybrid compression is shown below:

Given a bitstream: [10110…1][000…010][010…011]

There are three separate 32 bit sequences in the bitstream.

1) [1]0110…1 – 31 “dirty” bits (a literal)

2) [00]0…010 – 31 x 2 zeros (a sequence of zeros)

3) [01]0…011 – 31 x 3 ones (a sequences of ones)

Concise bitmap compression introduces the concept of a mixed fill, where fills and literals can be represented in a single word. The author of the original Concise paper claims that Concise outperforms WAH by reducing the size of the compressed bitmaps by up to 50%. For mixed fill sequences, the first 2 bits indicate the type of fill (0 or 1). The next 5 bits can be used to indicate the position where bits flip from 0 to 1 or vice versa. An example of the Concise representation for the integer set {3, 5, 31-93, 1,024, 1,028, 1,040,187,422} is shown below:

1) [1]0…101000

2) [01][00000]0…01

3) [00][00001]0…11101

4) [1]0…100010

5)[00][00000]1…1011101

6)[1]10…0

Efficiency at Scale

Although Concise compression can greatly reduce the size of resulting bitmaps, we still have the problem of performing efficient Boolean operations on top of a large number of Concise sets. Luckily, Concise sets share a very important property with other bitmap compression schemes: they can be operated on in their compressed form.  The Boolean operations we care about are AND, OR, and NOT. The NOT operation is the most straightforward to implement. Literals are directly complemented and fills of zeros and ones are inverted. ANDing and ORing sets prove to be more challenging.

Consider ORing two sets where one set is a long sequence of ones and the other set contains a shorter sequence of ones, a sequence of zeros, and some literals. If the sequence of ones in the first set is sufficiently long enough to encompass the second set, we don’t need to care about the second set at all (yay for Boolean logic!). Hence, when ORing sets, sequences of ones always have priority over sequences of zeros and literals. Similarly, a sequence of zeros can be ignored; the sequence contributes nothing to the overall Boolean logic. Extending this idea further with n sets, we can examine the first starting word of every set and determine if a one fill exists. If so, we can find the longest one fill and advance all the sets forward past this one fill. At this new stopping point, we repeat the search for the longest one fill. If no such fill exists, we search for all literals at that position, OR the literals together, and continue advancing forward. The same idea applies for ANDing sets, except now sequences of zeros have the highest priority. This pseudo-algorithm, combined with some additional logic to address mixed fills, was found to be sufficient to address our performance requirements.

Results

The following results were generated on a cc2.8xlarge system with a single thread, 2G heap, 512m young gen, and a forced GC between each run. The data set is a single day’s worth of data collected from the Twitter garden hose data stream. The data set contains 2, 272, 295 rows. The table below demonstrates a size comparison between Concise compressed sets and regular integer arrays for different dimensions.

Dimension Cardinality Concise compressed size (bytes) Integer array size (bytes) Concise size as a % of integer array size
Has_mention 2 586,400 9,089,180 6.451627
Has_links 2 580,872 9,089,180 6.390808
Has_geo 2 144,004 9,089,180 1.584345
Is_retweet 2 584,592 9,089,180 6.431735
Is_viral 2 358,380 9,089,180 3.942930
User_lang 21 1,414,000 9,089,180 15.556959
User_time_zone 142 3,876,244 9,089,180 42.646795
URL_domain 31,165 1,562,428 9,089,180 17.189978
First_hashtag 100,728 1,837,144 9,089,180 20.212428
Rt_name 182,704 2,235,288 9,089,180 24.592846
Reply_to_name 620,421 5,673,504 9,089,180 62.420416
User_location 637,774 9,511,844 9,089,180 104.650188
User_mention_name 923,842 9,086,416 9,089,180 99.969590
User_name 1,784,369 16,000,028 9,089,180 176.033790

 

Total concise compressed size = 53, 451, 144 bytes

Total integer array size = 127, 248, 520 bytes

Overall, Concise compressed sets are about 42.005317% less than integer arrays.

We also resorted the rows of the data set to maximize compression to see how the results would be affected.

Dimension Cardinality Concise compressed size (bytes) Integer array size (bytes) Concise size as a % of integer array size
Has_mention 2 744 9,089,180 0.008186
Has_links 2 1,504 9,089,180 0.016547
Has_geo 2 2,840 9,089,180 0.031246
Is_retweet 2 1,616 9,089,180 0.017779
Is_viral 2 1,488 9,089,180 0.016371
User_lang 21 38,416 9,089,180 0.422656
User_time_zone 142 319,644 9,089,180 3.516753
URL_domain 31,165 700,752 9,089,180 7.709738
First_hashtag 100,728 1,505,292 9,089,180 16.561362
Rt_name 182,704 1,874,180 9,089,180 20.619902
Reply_to_name 620,421 5,404,108 9,089,180 59.456497
User_location 637,774 9,091,016 9,089,180 100.075340
User_mention_name 923,842 8,686,384 9,089,180 95.568401
User_name 1,784,369 16,204,900 9,089,180 178.287810

 

Total concise compressed size = 43,832,884 bytes

Total integer array size = 127, 248, 520 bytes

What is interesting to note is that after sorting, global compression only increased minimally. The total Concise set size to total integer array size is 34.448031%.

To understand the performance implications of using Concise sets versus integer arrays, we choose several dimensions from our data set with varying cardinalities and generated Concise sets for every dimension value of every selected dimension. The histograms below indicate the size distribution of the generated Concise sets for a given dimension. Each test run randomly picked a given number of Concise sets and performed Boolean operations with them.  Integer array representations of these Concise sets were then created and the same Boolean operations were run on the integer arrays. There were 100 runs per test case and the average run time required to perform a Boolean operation is shown for each dimension in the first of two tables below. The second table shows the performance results when the sets used in the Boolean operation alway include the largest (size) set of the dimension.

Dimension: User_time_zone

Cardinality: 142

user_time_zone1-1024x768

Number of filter elements OR operation with Concise set (ms) OR operation with integer arrays (ms) AND operations with Concise set (ms) AND operation with integer arrays (ms)
10 31 20 1 0
25 66 53 2 0
50 159 153 4 0
100 339 322 7 0

Always including the largest Concise set of the dimension:

Number of filter elements OR operation with Concise set (ms) OR operation with integer arrays (ms) AND operations with Concise set (ms) AND operation with integer arrays (ms)
10 44 77 1 0
25 92 141 2 0
50 184 223 4 0
100 398 419 8 0

 

Dimension: URL_domain

Cardinality: 31,165

Number of filter elements OR operation with Concise set (ms) OR operation with integer arrays (ms) AND operations with Concise set (ms) AND operation with integer arrays (ms)
10 0 0 0 0
25 0 2 0 0
50 0 0 0 0
100 0 0 0 0
1,000 8 24 0 1
5,000 54 132 3 57
10,000 111 286 8 284
25,000 348 779 22 1,925

 

Always including the largest Concise set of the dimension:

Number of filter elements OR operation with Concise set (ms) OR operation with integer arrays (ms) AND operations with Concise set (ms) AND operation with integer arrays (ms)
10 14 172 0 0
25 17 242 0 0
50 19 298 0 0
100 22 356 0 0
1,000 35 569 1 1
5,000 89 865 4 59
10,000 158 1,050 9 289
25,000 382 1,618 21 1,949

 

Dimension: RT_name

Cardinality: 182,704

Number of filter elements OR operation with Concise set (ms) OR operation with integer arrays (ms) AND operations with Concise set (ms) AND operation with integer arrays (ms)
10 0 0 0 0
25 0 0 0 0
50 0 0 0 0
100 0 0 0 0
1,000 1 0 0 1
5,000 11 31 3 57
10,000 25 68 7 284
25,000 98 118 20 1,925
50,000 224 292
100,000 521 727

Note: for AND operations on 50,000+ items, our implementation of the array based approach produced StackOverflow exceptions.  Instead of changing the implementation to something that didn’t, we just decided not to do comparisons beyond that point.

Always including the largest Concise set of the dimension:

Number of filter elements OR operation with Concise set (ms) OR operation with integer arrays (ms) AND operations with Concise set (ms) AND operation with integer arrays (ms)
10 14 168 0 0
25 16 236 0 0
50 18 289 0 0
100 20 348 0 0
1,000 29 551 1 1
5,000 44 712 4 59
10,000 69 817 8 289
25,000 161 986 20 1,949
50,000 303 1,182

 

Dimension: User_location

Cardinality: 637,774

Number of filter elements OR operation with Concise set (ms) OR operation with integer arrays (ms) AND operations with Concise set (ms) AND operation with integer arrays (ms)
10 0 0 0 0
25 0 0 0 0
50 0 0 0 0
100 0 0 0 0
1,000 2 0 0 1
5,000 15 7 3 57
10,000 34 16 8 284
25,000 138 54 21 1,927
50,000 298 128
100,000 650 271
250,000 1,695 881
500,000 3,433 2,311

 

Always including the largest Concise set of the dimension:

Number of filter elements OR operation with Concise set (ms) OR operation with integer arrays (ms) AND operations with Concise set (ms) AND operation with integer arrays (ms)
10 14 47 0 0
25 16 67 0 0
50 18 80 0 0
100 20 97 0 0
1,000 29 153 1 1
5,000 48 206 4 59
10,000 81 233 9 290
25,000 190 294 21 1,958
50,000 359 378

 

Dimension: User_name

Cardinality: 1,784,369

Number of filter elements OR operation with Concise set (ms) OR operation with integer arrays (ms) AND operations with Concise set (ms) AND operation with integer arrays (ms)
10 0 0 0 0
25 0 0 0 0
50 0 0 0 0
100 0 0 0 0
1,000 1 0 0 1
5,000 7 2 3 57
10,000 17 6 7 283
25,000 74 19 21 1,928
50,000 177 45
100,000 440 108
250,000 1,225 379
500,000 2,504 978
1,000,000 5,076 2,460
1,250,000 6,331 3,265
1,500,000 7,622 4,036
1,750,000 8,911 4,982

 

Always including the largest Concise set of the dimension:

Number of filter elements OR operation with Concise set (ms) OR operation with integer arrays (ms) AND operations with Concise set (ms) AND operation with integer arrays (ms)
10 0 0 0 0
25 0 0 0 0
50 0 0 0 0
100 0 0 0 0
1,000 1 0 0 1
5,000 8 2 3 59
10,000 22 6 7 289
25,000 77 19 21 1,954
50,000 196 45

 

Looking for more Druid information? Learn more about our core technology.

Filed in Algorithms, Druid, Technology