Five Principles for Designing Real-Time Data Pipelines

May 27th, 2015 Sam Hecht

In the course of onboarding dozens of clients and petabytes of data to Metamarkets’ real-time solution, our Data Engineering team has learned a handful of best practices for ETL. That’s why today we’re announcing the release of our Real-Time Integration SDK, a set of open source tools that incorporate these learnings, and make it even easier to get running with Metamarkets. In the last month alone, it has enabled several clients to accelerate through the complex process of building a real-time data integration at petabyte scale.

The SDK includes a Java client library for posting data to our real-time API, the RdiClient, and a serialization library for generating OpenRTB-based event records, which we call RadTech Datatypes. Though the full details about the SDK are available in our documentation, in this post I’d like to share some of the principles that went into designing it.

1. Mind the Gap (i.e. Design for Data Retention)

Though real-time ingestion paves the way for faster time-to-insight, it also introduces a new set of technical challenges compared to batch-oriented systems. For instance, you lose the luxury of waiting around for the dataset to be complete, and processing issues are quickly manifested in the UI.

Gaps and delays can be extremely disruptive for end users, and even small discrepancies can erode trust in the solution. For this reason, when we built our real-time pipeline, we designed for data retention. Today we process tens of billions of events daily, all of it backed-up for a week in Kafka (a message bus) and in Amazon S3 for up to 90 days. This approach, which is part of our take on “lambda architecture”, allows for multiple failsafes if data comes in late or there are snags during processing.

However, our system can only provide complete and consistent reporting if our clients have the tools to embrace data retention upstream. As such, the RdiClient library never throws away any data it can’t post without first asking the user what to do with it. Though you could certainly choose to discard that data, we’ve included a ready-made Kafka implementation to encourage data retention.

2. Don’t Over-Complicate Error Handling

When sending data between two systems in real-time, you can count on running into exceptions, such as inevitable network issues. Though error handling and a good retry strategy may seem like obvious components for an API integration, it turns out to be a common stumbling block.

When we first set out to build the RdiClient, we attempted to write error handling logic to deal with every possible exception on a case-by-case basis. Before we knew it, the code for that logic was longer than that of the entire rest of client. Taking a step back, we realized that a uniform approach would cover the majority of issues, so we rewrote it using an exponential backoff retry strategy for all exception types. It’s simple, but it works +90% of the time. If data still doesn’t make it through, users can selectively implement more specific logic.

3. Separate Your Reporting and Revenue-Generating Systems

Particularly among young companies who start out with low data volumes, a common pitfall is to sacrifice scalability for short-term ease of setup. Separating your reporting pipeline from the systems generating the data ensures an error in one area won’t take the other down. It also makes it easier to maintain your data pipeline without having to make changes in a more sensitive production code base. For these reasons, we strongly recommend that clients run the RdiClient safely out of band of mission-critical systems (e.g. servers conducting auctions or bidding).

There are several best practices-based approaches you can take here. One approach we’ve seen is to use a proxy server and either delete or spill data to disk if a bounded queue fills up. Alternatively, we often recommend the use of message buffering services (e.g. Kafka or RabbitMQ) for backing up data until you can be sure that your data makes it through to the Metamarkets API. Log retention should be set to a window that ensures data preservation for as long as it might take to to fix issues that arise in posting data to the real-time API. We typically suggest that clients keep the data for 5-7 days, or however long you might be out of the office without VPN access.

4. Invest Early in Monitoring (It’ll Save You Huge Headaches)

Proactive monitoring is great. It’ll help you fix small problems before they turn into time consuming backfills. It’ll also get you ahead of the swarm of emails you’re liable to get from confused end users when something goes wrong.

The RdiClient comes with a built in object for directly exposing pipeline health metrics such as the number events successfully posted or HTTP errors encountered. This allows you to easily integrate data delivery metrics into an existing pipeline for infrastructure monitoring and alerting. We realize that it’s easy to treat monitoring as an afterthought, but investing in it early will empower you to automate away issues as they arise.

5. Leverage Standards to Increase Interoperability

If you’ve ever spent time under the hood in the ad tech data world, you’re probably acutely aware of the benefits of standardization. Whether you’re integrating with bidding and reporting APIs or working to get an aggregate view of inventory across suppliers, it makes a huge difference if you don’t have to develop around a custom data schema every time. In a market where standards are well-developed, you can avoid developing a custom data model for reporting and focus on work that will differentiate your business.

Metamarkets standard API schema for programmatic data leans heavily on the IAB’s OpenRTB API Specifications. Given that OpenRTB is a communication protocol between buyers and sellers, we had to make some modifications to meet the requirements of post-transaction reporting, but we’ve retained the core structure. While leveraging standards generally saves you time, it turns out it still takes quite a bit of error-prone code to generate the highly nested OpenRTB JSON records. To make this easier, we also included a Jackson-based Java serialization library in the SDK, called RadTech Datatypes, for creating these standards-based event records.

How to Move Forward

You can download the Integration SDK on Github here or import it directly into your project from artifactory. Because the project is open-sourced, we’re also looking forward to seeing our clients make improvements, such as the recent port of RadTech Datatypes to Node.js by George and the team at Avocarrot. If you have questions or feedback about implementing the SDK, or working with Metamarkets, don’t hesitate to reach out.

Important Note!