Data is the fuel and the exhaust of programmatic advertising. It informs every transaction, and every transaction generates more of it. As impression volumes rise into the trillions across all manner of devices, the focus of many ad tech engineering teams isn’t on ethereal machine learning algorithms, but something far less glamorous.

The process is called ETL — the critical, painstaking work of cleansing and consolidating disparate datasets. As the worlds of marketing and enterprise software collide, ETL could be the most important acronym you’ve never heard of.

ETL stands for extract, transform and load — and it’s a truism among data scientists that it takes up about 80% of our time, leaving just 20% for analysis. Having built big data platforms in pharma, banking and now in digital media, I believe this ratio is near universal.

Underinvestment in and misunderstanding of ETL is single-handedly responsible for a huge amount of organizational pain and inefficiency. It’s why data is so often delayed, why so many executives are unhappy with the quality of reporting and why more than 50% of corporate business intelligence initiatives fail.

ETL is hard because data is messy. There is no such thing as clean data, and even the most common attributes have a dizzying array of acceptable formats: "Sat Jan 22 10:37:13 PST," "2014-01-22T1837:13.0+0000" and "1323599850” all denote the same time. Add to this a growing variety of data, such as geocoordinates, buyer names, seller URLs, device IDs, campaign strings, country codes, currencies. Each new source adds a layer of bricks to our collective tower of Babel.

It's no wonder that an agency CIO recently confessed to me that he’d spent tens of millions of dollars a year on the reliable, repeatable transformation of data. As someone who has spent much of my career wrestling ETL’s demons, here are five ways for keeping them at bay:

Journalists know that when it comes to getting the facts, it’s best to go directly to the primary source and it’s best to break news first. The same is true for ETL. The closer you are to the data source, the fewer transformations and steps and the lower likelihood that something will break. The best ETL pipelines resemble tributaries feeding rivers, not bridges connecting islands. Also, the closer you are to the source, the faster you can optimize your approach, which in this space can pay huge dividends.

Just like food, data is best when it's minimally processed. In order to handle huge quantities of data, one common approach for ETL pipelines is to downsample it indiscriminately. Many programmatic buyers will examine, for example, a 1% feed of bid requests coming off of a particular marketplace.

In an era when bandwidth is cheap and computing resources are vast, sampling data is a throwback to the punchcard era — and worse, it waters down insights. Audience metrics like frequency and reach can become impossible to recover once a data stream has been put through the shredder. Sampling is why audience segments can resemble sausage — no one knows what’s inside.

In the early days of the railroads, as many as a dozen distinct track gauges, ranging from a width between the inside rails of 2 to nearly 10 feet, had proliferated across North America, Europe, Africa and Asia. Owing to the difficulties of non-interoperable trains and carriages, as well as continuous transport across regions, a standard width was eventually adopted at the suggestion of a British civil engineer named George Stephenson. Today, approximately 60% of the world's lines use this gauge.

Our programmatic vertical has its own George Stephensons, CTOs and chief scientists like Jim Butler and Neal Richter, whom you can find late at night debating specifications for OpenRTB protocols on developer lists. Just as with the railroads two centuries before, embracing and enforcing standards will catalyze faster growth in programmatic advertising through increased interoperability.

Too many organizations, upon recognizing that they've got data challenges, decide to undertake a grand data-unification project. Noble in its intentions, cheered by vendors and engineers alike, these efforts seek to funnel every source of data in the organization into a massive central platform. The implicit assumption is that "once we have all the data, we can answer any question we'd like." This approach is doomed to fail because there is always more data than one realizes, and the choices around what data to collect and how to structure it can only be made by putting business questions first.

ETL is hard, and building pipelines laborious, so avoid building bridges to places that no business inquiry will ever visit.

While for some organizational processes there's no avoiding working with the nuts and bolts of data, for others it may be possible to get out of the data handling business entirely. Take, for example, the handling of email or digital documents: For years, IT departments suffered through the management and occasional migration of these assets. Today, however, cloud offerings, such as those from Google and Box, make this someone else's problem, freeing up our businesses to specialize in what we do best.

*Follow Mike Driscoll (**@medriscoll**), Metamarkets* *(**@metamarkets**) and AdExchanger (**@adexchanger**) on Twitter.*

Approximation algorithms are rapidly gaining traction as the preferred way to determine the unique number of elements in high cardinality sets. In the space of cardinality estimation algorithms, HyperLogLog has quickly emerged as the de-facto standard. Widely discussed by technology companies and popular blogs, HyperLogLog trades accuracy in data and query results for massive reductions in data storage and vastly improved system performance.

In our previous investigation of HyperLogLog, we briefly discussed our motivations for using approximate algorithms and how we leveraged HyperLogLog in Druid, Metamarkets’ open source, distributed data store. Since implementing and deploying HyperLogLog last year, we’ve made several optimizations to further improve performance and reduce storage cost. This blog post will share some of those optimizations. This blog post assumes that you are already familiar with how HyperLogLog works. If you are not familiar with the algorithm, there are plenty of resources online.

In our initial implementation of HLL, we allocated 8 bits of memory for each register. Recall that each value stored in a register indicates the position of the first ‘1’ bit of a hashed input. Given that 2^255 ~== 10^76, a single 8 bit register could approximate (not well, though) a cardinality close to the number of atoms in the entire observable universe. Martin Traverso, et. al. of Facebook’s Presto, realized that this was a bit wasteful and proposed an optimization, exploiting the fact that the registers increment in near lockstep.

Given that each register is initially initialized with value 0, with 0 uniques, there is no change in any of the registers. Let’s say we have 8 registers. Then with 8*2^10 uniques, each register will have values ~ 10. Of course, there will be some variance, which can be calculated exactly if one were so inclined, given that the distribution in each register is an independent maximum of Negative Binomial (1, .5) draws.

With 4 bit registers, each register can only approximate up to 2^15 = 32,768 uniques. In fact, the reality is worse because the higher numbers cannot be represented and are lost, impacting accuracy. Even with 2,048 registers, we can’t do much better than ~60M, which is one or two orders of magnitude lower than what we need.

Since the register values tend to increase together, the FB folks decided to introduce an offset counter and only store positive differences from it in the registers. That is, if we have register values of 8, 7, and 9, this corresponds to having an offset of 7 and using register difference values of 1, 0, and 2. Given the smallish spread that we expect to see, we typically won’t observe a difference of more than 15 among register values. So we feel comfortable using 2,048 4 bit registers with an 8 bit offset, for 1025 bytes of storage < 2048 bytes (no offset and 8 bit registers).

In fact, others have commented on the concentrated distribution of the register values as well. In her thesis, Marianne Durand suggested using a variable bit prefix encoding. Researchers at Google have had success with difference encodings and variable length encodings.

This optimization has served us well, with no appreciable loss in accuracy when streaming many uniques into a single HLL object, because the offset increments when all the registers get hit. Similarly, we can combine many HLL objects of moderate size together and watch the offsets increase. However, a curious phenomenon occurs when we try to combine many “small” HLL objects together.

Suppose each HLL object stores a single unique value. Then its offset will be 0, one register will have a value between 1 and 15, and the remaining registers will be 0. No matter how many of these we combine together, our aggregate HLL object will never be able to exceed a value of 15 in each register with a 0 offset, which is equivalent to an offset of 15 with 0’s in each register. Using 2,048 registers, this means we won’t be able to produce estimates greater than ~ .7 * 2048^2 * 1 / (2048 / 2^15) ~ 47M. (*Flajolet, et al. 2007*)

Not good, because this means our estimates are capped at 10^7 instead of 10^80, irrespective of the number of true uniques. And this isn’t just some pathological edge case. Its untimely appearance in production a while ago was no fun trying to fix.

The root problem in the above scenario is that the high values (> 15) are being clipped, with no hope of making it into a “small” HLL object, since the offset is 0. Although they are rare, many cumulative misses can have a noticeably large effect. Our solution involves storing one additional pair, a “floating max” bucket with higher resolution. Previously, a value of 20 in bucket 94 would be clipped to 15. Now, we store (20, 94) as the floating max, requiring at most an additional 2 bytes, bringing our total up to 1027 bytes. With enough small HLL objects so that each position is covered by a floating max, the combined HLL object can exceed the previous limit of 15 in each position. It also turns out that just one floating max is sufficient to largely fix the problem.

Let’s take a look at one measure of the accuracy of our approximations. We simulate 1,000 runs of streaming 1B uniques into an HLL object and look at the proportion of cases in which we observed clipping with the offset approximation (black) and the addition of the floating max (red). So for 1e9 uniques, the max reduced clipping from 95%+ to ~15%. That is, in 85% of cases, the much smaller HLL objects with the floating max agreed with HLL versus less than 5% without the floating max.

For the cost of only 2 bytes, the floating max register allowed us to union millions of HLL objects with minimal measurable loss in accuracy.

We first discussed the concept of representing HLL buckets in either a sparse or dense format in our first blog post. Since that time, Google has also written a great paper on the matter. Data undergoes a summarization process when it is ingested in Druid. It is unnecessarily expensive to store raw event data and instead, Druid rolls ingested data up to some time granularity.

In practice, we see tremendous reductions in data volume by summarizing our data. For a given summarized row, we can maintain HLL objects where each object represents the estimated number of unique elements for a column of that row.

When the summarization granularity is sufficiently small, only a limited number of unique elements may be seen for a dimension. In this case, a given HLL object may have registers that contain no values. The HLL registers are thus ‘sparsely’ populated.

Our normal storage representation of HLL stores 2 register values per byte. In the sparse representation, we instead store the explicit indexes of buckets that have valid values in them as (index, value) pairs. When the sparse representation exceeds the size of the normal or ‘dense’ representation (1027 bytes), we can switch to using only the dense representation. Our actual implementation uses a heuristic to determine when this switch occurs, but the idea is the same. In practice, many dimensions in real world data sets are of low cardinality, and this optimization can greatly reduce storage versus only storing the dense representation.

One of the simpler optimizations that we implemented for faster cardinality calculations was to use lookups for register values. Instead of computing the actual register value by summing the register offset with the stored register value, we instead perform a lookup into a precalculated map. Similarly, to determine the number of zeros in a register value, we created a secondary lookup table. Given the number of registers we have, the cost of storing these lookup tables is near trivial. This problem is often known as the Hamming Weight problem.

Many of our optimizations came out of necessity, both to provide the interactive query latencies that Druid users have come to expect, and to keep our storage costs reasonable. If you have any further improvements to our optimizations, please share them with us! We strongly believe that as data sets get increasingly larger, estimation algorithms are key to keeping query times acceptable. The approximate algorithm space remains relatively new, but it is something we can build together.

For more information on Druid, please visit www.druid.io and follow @druidio. We’d also like to thank Eric Tschetter and Xavier Leaute for their contributions to this work. Featured image courtesy of Donna L Martin.

]]>Hadley introduced the crowd to dplyr, his new package that simplifies working with data frames in R, using a new verb syntax that allows users to express their data operations clearly and succinctly. Dplyr comes with several data backends, so R users can work transparently with data frames in R or SQL databases. Hadley's slides that accompany the talk are available here.

Joe talked about some of the new reactive programming features in Shiny, and walked the audience through several demos that showed off how easy it is to build Shiny apps. You can learn more about Shiny here, the SuperZIP example from the talk is visible here, the source code is available here.

Thanks to both Hadley and Joe for spending their evening with us and for sharing their knowledge with members of the R community. Watch the full session below and we look forward to seeing you at the next one!

Many businesses care about accurately computing quantiles over their key metrics, which can pose several interesting challenges at scale. For example, many service level agreements hinge on these metrics, such as guaranteeing that 95% of queries return in < 500ms. Internet service providers routinely use burstable billing, a fact that Google famously exploited to transfer terabytes of data across the US for free. Quantile calculations just involve sorting the data, which can be easily parallelized. However, this requires storing the raw values, which is at odds with a pre-aggregation step that helps Druid achieve such dizzying speed. Instead, we store smaller, adaptive approximations of these values as the building blocks of our "approximate histograms." In this post, we explore the related problems of accurate estimation of quantiles and building histogram visualizations that enable the live exploration of distributions of values. Our solution is capable of scaling out to aggregate billions of values in seconds.

When we first met Druid, we considered the following example of a raw impression event log:

timestamp | publisher | advertiser | gender | country | click | price |
---|---|---|---|---|---|---|

2011-01-01T01:01:35Z | bieberfever.com | google.com | Male | USA | 0 | 0.65 |

2011-01-01T01:03:63Z | bieberfever.com | google.com | Male | USA | 0 | 0.62 |

2011-01-01T01:04:51Z | bieberfever.com | google.com | Male | USA | 1 | 0.45 |

... | ... | ... | ... | ... | ... | ... |

2011-01-01T01:00:00Z | ultratrimfast.com | google.com | Female | UK | 0 | 0.87 |

2011-01-01T02:00:00Z | ultratrimfast.com | google.com | Female | UK | 0 | 0.99 |

2011-01-01T02:00:00Z | ultratrimfast.com | google.com | Female | UK | 1 | 1.53 |

By giving up some resolution in the timestamp column (e.g., by truncating the timestamps to the hour), we can produce a summarized dataset by grouping by the dimensions and aggregating the metrics. We also introduce the "impressions" column, which counts the rows from the raw data with that combination of dimensions:

timestamp | publisher | advertiser | gender | country | impressions | clicks | revenue |
---|---|---|---|---|---|---|---|

2011-01-01T01:00:00Z | ultratrimfast.com | google.com | Male | USA | 1800 | 25 | 15.70 |

2011-01-01T01:00:00Z | bieberfever.com | google.com | Male | USA | 2912 | 42 | 29.18 |

2011-01-01T02:00:00Z | ultratrimfast.com | google.com | Male | UK | 1953 | 17 | 17.31 |

2011-01-01T02:00:00Z | bieberfever.com | google.com | Male | UK | 3194 | 170 | 34.01 |

All is well and good if we content ourselves with computations that can be distributed efficiently such as summing hourly revenue to produce daily revenue, or calculating click-through rates. In the language of Gray et al., the former calculation is *distributive*: we can sum the raw event prices to produce hourly revenue over each combination of dimensions and in turn sum this intermediary for further coarsening into daily and quarterly totals. The latter is *algebraic*: it is a combination of a fixed number of distributive statistics, in particular, clicks / impressions.

However, sums and averages are of very little use when one wants to ask certain questions of bid-level data. Exchanges may wish to visualize the bid landscape so as to provide guidance to publishers on how to set floor prices. Because of our data-summarization process, we have lost the individual bid prices--and knowing that the 20 total bids sum to $5 won't tell us how many exceed $1 or $2. Quantiles, by contrast, are *holistic*: there is no constant bound on the size of the storage needed to exactly describe a sub-aggregate.

Although the raw data contain the unadulterated prices--with which we can answer these bid landscape questions exactly--let's recall why we much prefer the summarized dataset. In the above example, each raw row corresponds to an impression, and the summarized data represent an average compression ratio of ~2500:1 (in practice, we see ratios in the 1 to 3 digit range). Less data is both cheaper to store in memory and faster to scan through. In effect, we are trading off increased ETL effort against less storage and faster queries with this pre-aggregation.

One solution to support quantile queries is to store the entire array of ~2500 prices in each row:

timestamp | publisher | advertiser | gender | country | impressions | clicks | prices |
---|---|---|---|---|---|---|---|

2011-01-01T01:00:00Z | ultratrimfast.com | google.com | Male | USA | 1800 | 25 | [0.64, 1.93, 0.93, ...] |

2011-01-01T01:00:00Z | bieberfever.com | google.com | Male | USA | 2912 | 42 | [0.65, 0.62, 0.45, ...] |

2011-01-01T02:00:00Z | ultratrimfast.com | google.com | Male | UK | 1953 | 17 | [0.07, 0.34, 1.23, ...] |

2011-01-01T02:00:00Z | bieberfever.com | google.com | Male | UK | 3194 | 170 | [0.53, 0.92, 0.12, ...] |

But the storage requirements for this approach are prohibitive. If we can accept *approximate* quantiles, then we can replace the complete array of prices with a data structure that is sublinear in storage--similar to our sketch-based approach to cardinality estimation.

Ben-Haim and Tom-Tov suggest summarizing the unbounded-length arrays with a fixed number of (count, centroid) pairs. Suppose we attempt to summarize a set of numbers with a single pair. The mean (centroid) has the nice property of minimizing the sum of the squared differences between it and each value, but it is sensitive to outliers because of the squaring. The median is the minimizer of the sum of the absolute differences and for an odd number of observations, corresponds to an actual bid price. Bid prices tend to be skewed due to the mechanics of second price auctions--some bidders have no problem bidding $100, knowing that they will likely only have to pay $2. So a median of $1 is more representative of the "average" bid price than a mean of $20. However, with the (count, median) representation, there is no way to merge medians: knowing that 8 prices have a median of $.43 and 10 prices have a median of $.59 doesn't tell you that the median of all 18 prices is $.44. Merging centroids is simple--just use the weighted mean. Given some approximate histogram representation of (count, centroid) pairs, we can make *online* updates as we scan through data.

Of course, there is no way to accurately summarize an arbitrary number of prices with a single pair, so we are confronted with a classical accuracy/storage/speed tradeoff. We can fix the number of pairs that we store like so:

timestamp | publisher | advertiser | gender | country | impressions | clicks | AH_prices |
---|---|---|---|---|---|---|---|

2011-01-01T01:00:00Z | ultratrimfast.com | google.com | Male | USA | 1800 | 25 | [(1, .16), (48, .62), (83, .71), ...] |

2011-01-01T01:00:00Z | bieberfever.com | google.com | Male | USA | 2912 | 42 | [(1, .12), (3, .15), (30, 1.41), ...] |

2011-01-01T02:00:00Z | ultratrimfast.com | google.com | Male | UK | 1953 | 17 | [(2, .03), (1, .62), (20, .93), ...] |

2011-01-01T02:00:00Z | bieberfever.com | google.com | Male | UK | 3194 | 170 | [(1, .05), (94, .84), (1, 1.14), ...] |

In the first row, there is one bid at $.16, 48 bids with an average price of $.62, and so on. But given a set of prices, how do we summarize them as (count, centroid) pairs? This is a special case of the k-means clustering problem, which in general is NP-hard, even in the plane. Fortunately, however, the one-dimensional case is tractable and admits a solution via dynamic programming. The B-H/T-T approach is to iteratively combine the closest two pairs together by taking weighted means until we reach our desired size.

Here we illustrate the B-H/T-T summarization process for the integers 1 through 10, 15 and 20, and 12 and 25 each repeated 3 times, for 3 different choices of the number of (count, centroid) pairs.

There are 4 salient operations on these approximate histogram objects:

- Adding new values to the histogram: add a new pair, (1, value), and merge the closest pair if we exceed the size parameter
- Merging two histograms together: repeatedly add all pairs of values from one histogram to another
- Estimating the count of values below some reference value: build trapezoids between the pairs and look at the various areas
- Estimating the quantiles of the values represented in a histogram: walk along the trapezoids until you reach the desired quantile

We apply operation 1 during our ETL phase, as we group by the dimensions and build a histogram on the resulting prices, serializing this object into a Druid data segment. The compute nodes repeat operation 2 in parallel, each emitting an intermediate histogram to the query broker for combination (another application of operation 2). Finally, we can apply operation 3 repeatedly to estimate counts in between various breakpoints, producing a histogram plot. Or we can estimate quantiles of interest with operation 4.

Here we review the trapezoidal estimation of Ben-Haim and Tom-Tov with an example. Suppose we wanted to estimate the number of values less than or equal to 10 (the exact answer is 10) knowing that there are 10 points with mean 5.5, 4 with mean 12.8, and 4 with mean 23.8. We assume that half of the values lie to the left and half lie to the right (we shall improve upon this assumption in the next section) of the centroid. So we mark off that 5 values are smaller than the first centroid (this turns out to be correct). We then draw a trapezoid connecting the next two centroids and assume that the number of values between 5.5 and 10 is proportional to the area that this sub-trapezoid occupies (the latter half of which is marked in blue). We assume that half of the 10 values near 5.5 lie to its right, and half of the 4 values near 12.8 lie to its left and multiply the sum of 7 by the ratio of areas to come up with our estimate of 5.05 in this region (the exact answer is 5). Therefore, we estimate that there are 10.05 values less than or equal to 10.

Here we describe some improvements and efficiencies specific to our implementation of the B-H/T-T approximate histogram.

Computational efficiency at query time (operation 2) is more dear to us than at ETL time (operation 1). That is, we can spend a few more cycles in building the histograms if it allows for a very efficient means of combination. Our Java-based implementation of operation 2 using a heap to keep track of the differences between pairs can combine roughly 200K (size 50) histograms per second per core (on an i7-3615QM). This compares unfavorably with core scan rates an order of magnitude or two higher for count, sum, and group by queries. Although, to be fair, a histogram contains 1-2 orders of magnitude more information than a single count or sum. Still, we sought a faster solution. If we know ahead of time what the proper threshold below which to merge pairs is, then we can do a linear scan through the sorted pairs (which we can do at ETL time), choosing to merge or not based on the threshold. The exact determination of this threshold is difficult to do efficiently, but eschewing the heap-based solution for this approximation results in core aggregation rates of ~1.3M (size 50) histograms per second.

We have 3 different serialization formats when indexing depending on the nature of the data, for which we use the most efficent encoding:

- a dense format, storing all counts and centroids up to the configurable size parameter
- a sparse format, storing some number of pairs below the limit
- a compact format, storing the individual values themselves

It is important to emphasize that we can specify different levels of accuracy hierarchically. The above formats come into play when we index the data, turning the arrays of raw values into (count, centroid) pairs. Because indexing is slow and expensive and Druid segments are immutable, it's better at this level to err on the side of accuracy. So we do something like specify a maximum of 100 (count, centroid) pairs in indexing, which will allow for greater flexibility at query time, when we aggregate these together into some possibly different number of (count, centroid) pairs.

We use the superfluous sign bit of the count to determine whether a (count, centroid) pair with count > 1 is exact or not. Does a value of (2, 1.51) indicate 2 bid prices of $1.51, or 2 unequal bid prices that average to $1.51? The trapezoid method of count estimation makes no such distinction and will "spread out" its uncertainty equally. This can be problematic for the discrete, multimodal distributions characteristic of bid data. But given knowledge of which (count, centroid) pairs are exact, we can make more accurate estimates.

Recall that our data typically exhibit high skewness. Because the closest histogram pairs are continuously merged until the number of pairs is small enough, the remaining pairs are necessarily (relatively) far apart. It is logical to summarize 12 prices around $.10 and 6 prices around $.12 as 18 prices around $.11, but we wouldn't want to merge all prices under $2 because of the influence of 49 wildly-high prices--unless we are particularly interested in those outliers, that is. At the very least, we would like to be able to control our "area of interest"--do we care about the majority of the data or about those few outliers? For when we aggregate millions or billions of values, even with the tiniest skew, we'll end up summarizing the bulk of the distribution with a single (count, centroid) pair. Our solution is to define special limits, inside of which we maintain the accuracy of our estimates. This typically jives well with setting x-axis limits for our histogram visualization.

Here, we plot a histogram over ~18M prices, using default settings for the x-axis limits and bin widths. Due to the high degree of skew, the inferred limits are suboptimal, as they include prices ~$100. In addition, there are even negative bid prices (which could be erroneous or a way of expressing uninterest in the auction)!

Below, we set our resolution limits to $0 and $1 and vary the number of (count, centroid) pairs in our approximate histogram datastructure. The accuracy using only 5 pairs is abysmal and doesn't even capture the second mode in the $.20 to $.25 bucket. 50 pairs fare much better, and 200 are very accurate.

Let's take a look at some benchmarks on our modest demo cluster (4 m2.2xlarge compute nodes) with some wikipedia data. We'll look at the performance of the following aggregators:

- a count aggregator, which simply counts the number of rows
- a uniques aggregator, which implements a version of the HyperLogLog algorithm
- approximate histogram aggregators, varying the resolution from 10 pairs to 50 pairs to 200 pairs

We get about 1-3M summarized rows of data per week from Wikipedia, and the benchmarks over the full 32 week period cover 84M rows. There appears to be a roughly linear relationship between the query time and the quantity of data:

Indeed, the cluster scan rates tend to flatten out once we hit enough data:

We previously obtained cluster scan rates of 26B rows per second on a beefier cluster. Very roughly speaking, the approximate histogram aggregator is 1/10 the speed of the count aggregator, so we might expect speeds of 2-3B rows per second on such a cluster. Recall that our summarization step compacts 10-100 rows of data into 1, for typical datasets. This means that it is possible to construct histograms representing tens to hundreds of billions of prices in seconds.

If you enjoyed reading thus far and have ideas for how to achieve greater speed/accuracy/flexibility, we encourage you to join us at Metamarkets.

Finally, my colleague Fangjin Yang and I will continue the discussion in October in New York at the Strata Conference where we will present, "Not Exactly! Fast Queries via Approximation Algorithms."

]]>As a consequence, the common problem for most ad-tech companies is that their analytics capabilities aren’t able to accommodate “Big Data” nor are they able to cater to the diverse nature of their buyer/seller needs. The density and velocity of real-time data transactions aren’t effectively analyzed and rarely are they ingested and processed quickly enough. The result is a tedious and exhaustingly slow data processing timeline which is further delayed by the use of outdated analytics tools from the 90s (e.g. Excel). These practices force users to build out pivot tables and search for the insights they are looking for from CSV files cobbled together from Hadoop clusters. If you issue the wrong query, you run the risk of having to start all over again. If not, hopefully hours or even days later, you may have found the insight you were looking for. Moreover, chances are high that you are basing your decision off of a sample of data that you hope is directional and convincing enough to act upon. In the past, these practices would pass as acceptable across the industry. However, marketers are demanding faster turnaround, more potent insights and greater accountability. As a result, ad-tech buyers and market makers need the tools and capabilities to match these pressures because those who fall behind the curve will become less relevant in the age of data-driven advertising.

The “do-it-yourself“ Big Data analytics stack has proven time and time again to be extremely costly, time-consuming and resource draining. The multitude of features and requirements necessary just to keep pace with the demands of real-time transactions are mind-boggling; it ranges from dedicating engineering resources, to system revisions, documentation, UI design, business process analysis, mockups, user reviews and beta testing. It typically takes years to develop a product that can properly and effectively corral Big Data. However, by this point, the market will have evolved, the business requirements will have changed and you will have lost valuable time, money and manpower that could have been applied to your core value deliverable.

For example, consider a typical “do it yourself” analytics stack: at the ETL/Processing level, you have a standing Hadoop cluster costing you $600K/year (estimating 10TB data daily). To say that ETL is difficult would be a gross understatement. Beyond these difficulties, if you don’t have the Hadoop expertise in house, you’ll need to hire consultants to solve problems that arise from lack of support, and it can take days or weeks to make a fix to an issue at the ETL level. At the Storage level, you might use a data store such as Vertica or Netezza, on which ad-tech firms are known to be spending over $1M/year on licenses and hardware. Then, your Analytics layer might consist of SAS Enterprise Miner, coming in at an estimated $100K for a small team. Finally, you’ll need to invest in a generalized Visualization dashboard powered by a company like Tableau or Spotfire which will quickly cost you up to $100K annually to serve dozens of end users. That’s just the software stack. Next, you’ll need to tack on setup, customization and integration with maintenance and support. Maintenance alone will become an unavoidable fixed cost and at some point, a core fundamental component of your stack will likely render itself obsolete, and as time goes on, it will become more difficult to replace it with a more contemporary solution. If you’re lucky, $2 million later, you’ll have a system that doesn’t even generate real-time business insights.

This business model is unsustainable and inefficient. The good news is that it doesn’t have to be that way.

Metamarkets believes our solution is changing this outdated model with a fully integrated end-to-end stack. We can meet your need for speed and depth by allowing your company to focus on your core business and technology. Our solution is a cloud-based, scalable, flexible, real-time, interactive analytics service, built on top of our custom-built open-source datastore (Druid) that can process high volume queries at an unprecedented rate. The result? We can return complex queries back in less than a second’s time - faster than any database handling the type of data volume your typical ad-tech company demands. Our system has the ability to join data streams from multiple sources including server-2-server (S2S) connections, RTB exchanges, 3rd party data providers and more, so you have both a holistic view of your business and so you can uncover granular insights that drive key decisions in a moment’s notice. The best part of all: this can be deployed in less than four weeks and at a fraction of the cost of a “do-it-yourself” analytics stack.

In all likelihood, the companies that are able to best rise to the challenge of conquering Big Data analytics will survive, while the rest will continue to drain their resources as competitors capitalize on their ability to quickly identify key insights and act on real-time business intelligence. Making the choice to buy rather than build will allow your company to focus your valuable resources on your core business. At Metamarkets, we couldn’t be more excited about empowering your managers to spend less time on the tactical and more time on the strategic. For a demo of our product, please reach out to contact@metamarkets.com. We look forward to hearing from you.