Maximum Performance with Minimum Storage: Data Compression in Druid

September 21st, 2012 Fangjin Yang

The Metamarkets solution allows for arbitrary exploration of massive data sets. Powered by Druid, our in-house distributed data store and processor, users can filter time series and top list queries based on Boolean expressions of dimension values. Given that some of our dataset dimensions contain millions of unique values, the subset of things that may match a particular filter expression may be quite large. To design for these challenges, we needed a fast and accurate (not a fast and approximate) solution, and we once again found ourselves buried under a stack of papers, looking for an answer.

From Justin Bieber to Ones and Zeros

To better understand how Druid stores dimension values, consider the following data set:

Timestamp

Publisher

Advertiser

Gender

Country

Impressions

Clicks

Revenue

2011-01-01T01:00:00Z

bieberfever.com

google.com

Male

USA

15.70

2011-01-01T01:00:00Z

bieberfever.com

google.com

Male

USA

29.18

2011-01-01T02:00:00Z

ultratrimfast.com

google.com

Male

USA

17.31

2011-01-01T02:00:00Z

ultratrimfast.com

google.com

Male

USA

34.01

Consider the publisher dimension (column) in the table above. For each unique publisher, we can form some representation indicating in which table rows a particular publisher is seen. We can store this information in a binary array where the array indices represent our rows. If a particular publisher is seen in a certain row, that array index is marked as ‘1’. For example:

Bieberfever.com -> [1, 2] -> [1][1][0][0]

Ultratrimfast.com -> [3, 4] -> [0][0][1][1]

In the example above bieberfever.com is seen in rows 1 and 2. This mapping of dimension values to row indices forms an inverted index and is in fact how we store dimension information in Druid. If we want to know which rows contain bieberfever.com OR ultratrimfast.com, we can OR together the bieberfever.com and ultratrimfast.com arrays.

[0][1][0][1] OR [1][0][1][0] = [1][1][1][1]

This idea forms the basis of how to perform Boolean operations on large bitmap sets. A challenge still remains in that if each array consisted of millions or billions of entries and if we had to OR together millions of such arrays, performance can potentially become a major issue. Thankfully for us, most bitmap indices are either very sparse or very dense, which is something that can be leveraged for compression.

Bit arrays, or bitmaps, are frequently employed in areas such as data warehousing and data mining to significantly reduce storage costs. Bitmap compression algorithms are a well-defined area of research and often utilize run-length encoding. Well known algorithms include Byte-aligned Bitmap Code, Word-Aligned Hybrid (WAH) code, and Partitioned Word-Aligned Hybrid (PWAH) compression.

A Concise Solution

Most word-aligned run-length encoding algorithms represent long sequences of ones and zeros in a single word. The word contains the length of the sequence and some information about whether it is a one fill or a zero fill. Sequences that contain a mixture of 0 and 1 bits are stored in 32 bit blocks known as literals. An example of word-aligned hybrid compression is shown below:

Given a bitstream: [10110…1][000…010][010…011]

There are three separate 32 bit sequences in the bitstream.

1) [1]0110…1 – 31 “dirty” bits (a literal)

2) [00]0…010 – 31 x 2 zeros (a sequence of zeros)

3) [01]0…011 – 31 x 3 ones (a sequences of ones)

Concise bitmap compression introduces the concept of a mixed fill, where fills and literals can be represented in a single word. The author of the original Concise paper claims that Concise outperforms WAH by reducing the size of the compressed bitmaps by up to 50%. For mixed fill sequences, the first 2 bits indicate the type of fill (0 or 1). The next 5 bits can be used to indicate the position where bits flip from 0 to 1 or vice versa. An example of the Concise representation for the integer set {3, 5, 31-93, 1,024, 1,028, 1,040,187,422} is shown below:

1) [1]0…101000

2) [01][00000]0…01

3) [00][00001]0…11101

4) [1]0…100010

5)[00][00000]1…1011101

6)[1]10…0

Efficiency at Scale

Although Concise compression can greatly reduce the size of resulting bitmaps, we still have the problem of performing efficient Boolean operations on top of a large number of Concise sets. Luckily, Concise sets share a very important property with other bitmap compression schemes: they can be operated on in their compressed form. The Boolean operations we care about are AND, OR, and NOT. The NOT operation is the most straightforward to implement. Literals are directly complemented and fills of zeros and ones are inverted. ANDing and ORing sets prove to be more challenging.

Consider ORing two sets where one set is a long sequence of ones and the other set contains a shorter sequence of ones, a sequence of zeros, and some literals. If the sequence of ones in the first set is sufficiently long enough to encompass the second set, we don’t need to care about the second set at all (yay for Boolean logic!). Hence, when ORing sets, sequences of ones always have priority over sequences of zeros and literals. Similarly, a sequence of zeros can be ignored; the sequence contributes nothing to the overall Boolean logic. Extending this idea further with n sets, we can examine the first starting word of every set and determine if a one fill exists. If so, we can find the longest one fill and advance all the sets forward past this one fill. At this new stopping point, we repeat the search for the longest one fill. If no such fill exists, we search for all literals at that position, OR the literals together, and continue advancing forward. The same idea applies for ANDing sets, except now sequences of zeros have the highest priority. This pseudo-algorithm, combined with some additional logic to address mixed fills, was found to be sufficient to address our performance requirements.

Results

The following results were generated on a cc2.8xlarge system with a single thread, 2G heap, 512m young gen, and a forced GC between each run. The data set is a single day’s worth of data collected from the Twitter garden hose data stream. The data set contains 2, 272, 295 rows. The table below demonstrates a size comparison between Concise compressed sets and regular integer arrays for different dimensions.

Dimension	Cardinality	Concise compressed size (bytes)	Integer array size (bytes)	Concise size as a % of integer array size
Has_mention	2	586,400	9,089,180	6.451627
Has_links	2	580,872	9,089,180	6.390808
Has_geo	2	144,004	9,089,180	1.584345
Is_retweet	2	584,592	9,089,180	6.431735
Is_viral	2	358,380	9,089,180	3.942930
User_lang	21	1,414,000	9,089,180	15.556959
User_time_zone	142	3,876,244	9,089,180	42.646795
URL_domain	31,165	1,562,428	9,089,180	17.189978
First_hashtag	100,728	1,837,144	9,089,180	20.212428
Rt_name	182,704	2,235,288	9,089,180	24.592846
Reply_to_name	620,421	5,673,504	9,089,180	62.420416
User_location	637,774	9,511,844	9,089,180	104.650188
User_mention_name	923,842	9,086,416	9,089,180	99.969590
User_name	1,784,369	16,000,028	9,089,180	176.033790

Total concise compressed size = 53, 451, 144 bytes

Total integer array size = 127, 248, 520 bytes

Overall, Concise compressed sets are about 42.005317% less than integer arrays.

We also resorted the rows of the data set to maximize compression to see how the results would be affected.

Dimension	Cardinality	Concise compressed size (bytes)	Integer array size (bytes)	Concise size as a % of integer array size
Has_mention	2	744	9,089,180	0.008186
Has_links	2	1,504	9,089,180	0.016547
Has_geo	2	2,840	9,089,180	0.031246
Is_retweet	2	1,616	9,089,180	0.017779
Is_viral	2	1,488	9,089,180	0.016371
User_lang	21	38,416	9,089,180	0.422656
User_time_zone	142	319,644	9,089,180	3.516753
URL_domain	31,165	700,752	9,089,180	7.709738
First_hashtag	100,728	1,505,292	9,089,180	16.561362
Rt_name	182,704	1,874,180	9,089,180	20.619902
Reply_to_name	620,421	5,404,108	9,089,180	59.456497
User_location	637,774	9,091,016	9,089,180	100.075340
User_mention_name	923,842	8,686,384	9,089,180	95.568401
User_name	1,784,369	16,204,900	9,089,180	178.287810

Total concise compressed size = 43,832,884 bytes

Total integer array size = 127, 248, 520 bytes

What is interesting to note is that after sorting, global compression only increased minimally. The total Concise set size to total integer array size is 34.448031%.

To understand the performance implications of using Concise sets versus integer arrays, we choose several dimensions from our data set with varying cardinalities and generated Concise sets for every dimension value of every selected dimension. The histograms below indicate the size distribution of the generated Concise sets for a given dimension. Each test run randomly picked a given number of Concise sets and performed Boolean operations with them. Integer array representations of these Concise sets were then created and the same Boolean operations were run on the integer arrays. There were 100 runs per test case and the average run time required to perform a Boolean operation is shown for each dimension in the first of two tables below. The second table shows the performance results when the sets used in the Boolean operation alway include the largest (size) set of the dimension.

Dimension: User_time_zone

Cardinality: 142

Number of filter elements	OR operation with Concise set (ms)	OR operation with integer arrays (ms)	AND operations with Concise set (ms)	AND operation with integer arrays (ms)
10	31	20	1	0
25	66	53	2	0
50	159	153	4	0
100	339	322	7	0

Always including the largest Concise set of the dimension:

Number of filter elements	OR operation with Concise set (ms)	OR operation with integer arrays (ms)	AND operations with Concise set (ms)	AND operation with integer arrays (ms)
10	44	77	1	0
25	92	141	2	0
50	184	223	4	0
100	398	419	8	0

Dimension: URL_domain

Cardinality: 31,165

Number of filter elements	OR operation with Concise set (ms)	OR operation with integer arrays (ms)	AND operations with Concise set (ms)	AND operation with integer arrays (ms)
10	0	0	0	0
25	0	2	0	0
50	0	0	0	0
100	0	0	0	0
1,000	8	24	0	1
5,000	54	132	3	57
10,000	111	286	8	284
25,000	348	779	22	1,925

Always including the largest Concise set of the dimension:

Number of filter elements	OR operation with Concise set (ms)	OR operation with integer arrays (ms)	AND operations with Concise set (ms)	AND operation with integer arrays (ms)
10	14	172	0	0
25	17	242	0	0
50	19	298	0	0
100	22	356	0	0
1,000	35	569	1	1
5,000	89	865	4	59
10,000	158	1,050	9	289
25,000	382	1,618	21	1,949

Dimension: RT_name

Cardinality: 182,704

Number of filter elements	OR operation with Concise set (ms)	OR operation with integer arrays (ms)	AND operations with Concise set (ms)	AND operation with integer arrays (ms)
10	0	0	0	0
25	0	0	0	0
50	0	0	0	0
100	0	0	0	0
1,000	1	0	0	1
5,000	11	31	3	57
10,000	25	68	7	284
25,000	98	118	20	1,925
50,000	224	292	–	–
100,000	521	727	–	–

Note: for AND operations on 50,000+ items, our implementation of the array based approach produced StackOverflow exceptions. Instead of changing the implementation to something that didn’t, we just decided not to do comparisons beyond that point.

Always including the largest Concise set of the dimension:

Number of filter elements	OR operation with Concise set (ms)	OR operation with integer arrays (ms)	AND operations with Concise set (ms)	AND operation with integer arrays (ms)
10	14	168	0	0
25	16	236	0	0
50	18	289	0	0
100	20	348	0	0
1,000	29	551	1	1
5,000	44	712	4	59
10,000	69	817	8	289
25,000	161	986	20	1,949
50,000	303	1,182	–	–

Dimension: User_location

Cardinality: 637,774

Number of filter elements	OR operation with Concise set (ms)	OR operation with integer arrays (ms)	AND operations with Concise set (ms)	AND operation with integer arrays (ms)
10	0	0	0	0
25	0	0	0	0
50	0	0	0	0
100	0	0	0	0
1,000	2	0	0	1
5,000	15	7	3	57
10,000	34	16	8	284
25,000	138	54	21	1,927
50,000	298	128	–	–
100,000	650	271	–	–
250,000	1,695	881	–	–
500,000	3,433	2,311	–	–

Always including the largest Concise set of the dimension:

Number of filter elements	OR operation with Concise set (ms)	OR operation with integer arrays (ms)	AND operations with Concise set (ms)	AND operation with integer arrays (ms)
10	14	47	0	0
25	16	67	0	0
50	18	80	0	0
100	20	97	0	0
1,000	29	153	1	1
5,000	48	206	4	59
10,000	81	233	9	290
25,000	190	294	21	1,958
50,000	359	378	–	–

Dimension: User_name

Cardinality: 1,784,369

Number of filter elements	OR operation with Concise set (ms)	OR operation with integer arrays (ms)	AND operations with Concise set (ms)	AND operation with integer arrays (ms)
10	0	0	0	0
25	0	0	0	0
50	0	0	0	0
100	0	0	0	0
1,000	1	0	0	1
5,000	7	2	3	57
10,000	17	6	7	283
25,000	74	19	21	1,928
50,000	177	45	–	–
100,000	440	108	–	–
250,000	1,225	379	–	–
500,000	2,504	978	–	–
1,000,000	5,076	2,460	–	–
1,250,000	6,331	3,265	–	–
1,500,000	7,622	4,036	–	–
1,750,000	8,911	4,982	–	–

Always including the largest Concise set of the dimension:

Number of filter elements	OR operation with Concise set (ms)	OR operation with integer arrays (ms)	AND operations with Concise set (ms)	AND operation with integer arrays (ms)
10	0	0	0	0
25	0	0	0	0
50	0	0	0	0
100	0	0	0	0
1,000	1	0	0	1
5,000	8	2	3	59
10,000	22	6	7	289
25,000	77	19	21	1,954
50,000	196	45	–	–

Important Note!

Maximum Performance with Minimum Storage: Data Compression in Druid

September 21st, 2012 Fangjin Yang

Looking for more Druid information? Learn more about our core technology.

Filed in Algorithms, Druid, Technology Tagged Druid