Metamarkets

Four Things I Learned at Advertising Week

Frank Bauch — Thu, 28 Sep 2017 20:53:56 +0000

This week, I was lucky enough to attend Advertising Week in New York. For those who haven’t been, it’s an unorthodox conference in that there’s no central meeting place or locations to check-in. Instead, various venues around Times Square in Manhattan transform into presentation spaces for the leaders in the marketing and advertising world.

One could surmise from the long lines leading into panels this week that there’s tremendous interest in hearing more about how companies are tackling their biggest marketing challenges. Here are a few of the major themes that stood out to me:

1. Programmatic’s promising future

Advertising Week covers a broad spectrum of advertising topics – you’ll find upstart digital agencies and more established, traditional advertisers. You’ll hear conversations about fashion, entertainment, and even mascots (a panel featuring Smokey Bear and Mr. Peanut was a dream come true!). And there were certainly plenty of discussions about AI and its future.

Despite the range of topics, it was clear that programmatic buying was at the very center of conversations this week. Statistics from eMarketer about the projected $33 billion spent on programmatic display ads in 2017 were seemingly a prerequisite to include within the first few slides of each presentation.

In a panel about the “Next Era of Programmatic,” Tim Cadogan of OpenX projected that we’re arriving at the third stage for programmatic. The first stage proved the business model, the second stage focused on expanding scope – this third stage will include additional expansion of scope and scale, but will also reflect programmatic becoming the default transaction for digital media.

2. Transparency is the right side of history

Brands and publishers who had been flying blind in the past are now laser-focused on getting transparent data from partners to minimize their wasted spend. This led to plenty of stories being shared on stage about revelations from the past few months of reviewing their supply chains.

Jess Barrett of the Financial Times recalled how her team found that 15 exchanges, including many of the well-known players in the industry, claimed to be selling programmatic video inventory from FT.com, even though they don’t even sell video ads programmatically. Sarah Warner of GroupM recalled the eye-opening article about Chase reducing their advertising from 400,000 sites to just 5,000 – but Warner says that’s still too many sites for her; her whitelist only includes about 1,000 sites today.

Jason Fairchild of OpenX told the crowd at a panel on radical transparency that buyers have a choice – they don’t have to be subjected to “flea-market marketplaces,” a description he assigns to marketplaces that aren’t revealing the origin data or additional insights on their inventory. Instead, they can choose to work with the right technology partners who care about transparency. Scott Gifis of AdRoll reminded us that sometimes that means bringing some bad news to your buyers that reveals areas where they wasted money and showing them how to fix it.

All in all, it’s clear there is an industry-wide conversation happening about working with more transparent partners and that we’re starting to see an impact made with companies who’ve already made changes.

3. Who solves the problems?

There were several debates this week about how we should tackle the industry’s problems with trust, transparency and accurate measurement. I heard at least a few conversations where industry leaders expressed their thoughts that agencies in their current model weren’t properly equipped with the right tech to address these needs.

Ideally, brands are looking for guidance from their partners, be it agencies, DSPs or another tech provider, to help them navigate the ever-changing digital landscape with new technology. But it’s also important that everyone in the transaction recognize their role in fixing the industry’s problems. Brands who don’t take control of their data, or who don’t have a basic understanding of the programmatic process upon which they’re spending their money, will be most vulnerable to wasting dollars by not asking the right questions.

There’s also a need for balance between the Walled Gardens of digital marketing and neutral third-parties that can validate their data. One of the more popular panels at the show featured executives from Facebook, Pinterest and Google discussing their approaches to analytics – it was clear from the responses of the panelists that these large companies have heard the concerns of marketers and are working to become more transparent. Google’s Babak Pahlavan talked about the importance of trust: “We know our first party solutions need to be trusted, but we need accredited, trusted third parties to measure us as well.”

4. Demand for insights, not just data

Often times, you can learn a lot by paying attention to the job titles of panelists discussing a particular topic.

During one session on data accuracy at the Tech Xperience stage, a panelist pointed out that everyone on stage was a CRO – in a panel about data accuracy, there were no data scientists or Chief Data Officers. This was a reminder that data is really the vehicle to getting insights that drive revenue. We can’t just focus on exposing data for the sake of exposing it – instead we should remember that transparency is about revealing data in ways that get us to the revenue-driving insights quicker.

To that end, several discussions were focused on the need for better methods of measurement that actually accomplish this goal. A partnership between the ARF, CIMM and Pre-Meditated Media announced a new “data labeling” program. It’s a step in the right direction to helping tell complete stories with data.

Needless to say, it was a productive week hearing from industry leaders across the ecosystem and diving deeper into the problems each player is facing. If you’d like to continue the conversations from this week, we’d love to talk about ways to help you be more transparent with programmatic data using interactive analytics! Visit us at metamarkets.com to learn more.

How to Discover Revenue Opportunities in Your Bid Landscape with Metamarkets Heat Map

Jordan Richman — Thu, 21 Sep 2017 16:19:08 +0000

With hundreds of publishers and buyers, it’s difficult to keep track of all of your partners’ bidding activity. Many demand managers get reports each morning informing them where their accounts have the highest spend, but those reports likely won’t tell you much about all the premium inventory your buyers are missing out on. Discovering gaps in your partners’ bid landscape can reduce revenue leakage and increase spend.

The Heat Map view in Metamarkets Explore is a great way to expose these gaps. The chart displayed here shows your top 10 bidders on the vertical axis and your top 10 publishers on the horizontal axis by total revenue for a full week. These are your biggest buyers and highest performing inventory on your exchange.

The darker the blue square, the higher the value of the metric you’re filtered on. Lighter squares have minimal spend, and no color means no spend at all.

In this chart, you can see Bidder 2 and Bidder 5 are buying across all your top publishers. But what about Bidder 9? Bidder 9 is one of the top 10 revenue drivers for your exchange, but they are only buying on 3 of your premium publishers.

This could indicate an integration error with this buyer, or that a temporary block was put on them when they were running a specific campaign and it was never removed. It could also mean it’s just not the right type of inventory. No matter what, it’s definitely worth investigating.

Because of the drill down nature of Metamarkets, the Heat Map view can be used for many different use cases. You can visualize which publishers command the highest bids from your buyers to better understand your inventory, or look at bidders against ad size by total bid count, to better shape the QPS your partners are being sent.

The Heat Map is one of many great options for visualizing your programmatic data within Metamarkets Explore. We know our customers need flexible ways to analyze their data, with visuals that reveal deeper layers of insights in real-time. If you’d like to learn more about using the Heat Map or any other views, reach out to learn more about what’s available to you with Metamarkets.

Tips for Optimizing Your Active Campaigns

Frank Bauch — Mon, 18 Sep 2017 18:33:05 +0000

In a recent post, we discussed a few examples of why providing your buyers access to your inventory through Metamarkets Explore can help them discover new opportunities. But what about when a campaign is in mid-flight? Providing access to data visualizations that clearly show your inventory throughout a campaign can be the difference between a successful campaign and one that drastically misses expectations.

Some of our customers tweak their campaigns across more than 20 dimensions at once – they don’t need to write any new queries or wait for results; they can instantly evaluate inventory on an ad hoc basis when dimensions change and act quickly on that information to maximize their ROI. Even with alerts in place, there are always going to be reasons to shift some aspect of the campaign mid-flight to tweak the audience type along any number of parameters.

For example, let’s say you have had a campaign running for two days and want to make sure it is reaching expectations. With our inventory discovery solution, you can instantly click into your campaign to see how many bids you are winning and the total number of clicks and impressions those bids are delivering.

When this campaign was initially planned there was a lot of available inventory targeting the Health and Fitness audience (the target audience of this campaign) at around 85 cents, but after two days the volume of bids won isn’t meeting expectations. By clicking on the target segment you can see most winning bids are now going for a dollar or more, which means bumping up the bid price by 15 cents has potential to win more bids for the inventory desired.

What if you are price constrained and can’t bump your bid price to match the new average bid amount? By clicking into your desired inventory price point range, you can identify other audience categories with higher inventory amounts. In this case, the Arts and Entertainment segment is another applicable target I would like to hit with this campaign and I can work with my exchange partners directly to get the scope of my campaign changed on the fly to target a new audience and inventory.

Examples like these are the best way to provide a clear understanding of key inventory performance metrics – if you take advantage of these metrics, you’ll help buyers identify opportunities to increase spending on their campaigns at the points where impressions are highest and inventory is at its best price. It’s a win win!

For more information on how to utilize interactive analytics to understand your inventory availability, feel free to contact us for a demo of the latest capabilities with Metamarkets Explore.

Druid and Spark Together – Mixing Analytics Workflows

Charles Allen — Fri, 15 Sep 2017 15:53:31 +0000

One method of looking at the human thought process is that we have different brain workflows for different analytics and data processing needs. At Metamarkets we have the same thing for our data processing machines. This post will explore some of our experience with bin-packing query nodes featuring Druid with batch processing featuring Spark, using Apache Mesos as the resource coordinator.

Thinking Fast

The above shows a typical “work” pattern for a Druid historical node throughout a typical week day (times shown are EDT). It shows the quantity of CPU seconds consumed by the JVM to answer queries. There are two things that are of extreme interest in the usage: one is that there is a daily seasonality that peaks sometime just after noon in New York, and the other is that spikes of extreme work come in short bursts. The bursty queries are mitigated by properly configuring query priorities such that queries that are expected to be light and fast get higher priority than queries which are expected to take a longer time to answer. We also have different nodes tuned with different work queue settings. In order to make sure our >90% response times are reasonable, we have to provision CPU capacity for such peaks. Overall this means the CPU on these machines end up with a lot of idle time where they are doing no work.

Thinking Slow

Using Spark on Mesos we launch approximately 3,000 spark batch jobs (each launches many mesos-tasks, up to ~1,000) every day, which range in duration from a few minutes to a few hours. Our batch jobs are split into two stages: ETL and Indexing. The ETL stage does all of the data manipulation and has a specific JVM footprint that works well for it. Our Indexing jobs have a very different resource footprint, so they are broken into a separate job. To save costs, our Spark cluster was originally running on spot nodes. Spark’s ability to recover from spot market fluctuations is pretty good, but when you’re running as much spot as we do, you tend to dominate markets and can be very sensitive to spot market fluctuations.

A handy thing about running on Mesos is that switching spot resource pools is very easy, requiring just new nodes to launch. Spark automatically picks up the offers from the new nodes and responds appropriately. But still, intermittent failures were becoming too frequent as our spot footprint continued to grow. When a 6 hour batch job took 4 hours to run with minimal failures, regular spot market fluctuations could easily cause that batch workload to fall behind, causing us to scale up even MORE spot usage… thus creating a vicious cycle. Overall there is a need to have a more stable baseline pool of resources with the ability to spike into the spot market as needed, but still be able to do so at a reasonable price point.

The Marriage

The fast result work path through Druid has a lot of remnant CPU capacity that goes underutilized, and the slow work path through Spark has a need for a more stable and predictable baseline. If we put these two together properly, it means we should be able to evict Spark CPU usage to accommodate spikey Druid query load. And in doing so, allow Spark (which now has the added resources of the Druid cluster) and Druid (which now has the added resources of the Spark cluster) to have more resources available!

OS Lineage

Before delving into packing compute nodes with work, it is important to understand some of the underlying technologies and their lineage. One of the challenges with running many things on the same node is figuring out how to get everything on the node which needs to run, and to make sure it doesn’t interfere with other stuff on the node.

There are three options you can go through to solve this. The first option is to have dedicated machines for specific tasks and only install things you need to accomplish the tasks of that machine, plus some way to manage the state of the configurations and versions across your machines. The second option is to try and install everything on all your machines and hope you don’t have any version conflicts – in that case, when one component needs one thing upgraded then all your machines are going through an upgrade process with unknown and potentially unisolated impacts.

The third option is to install the absolute bare minimum of an installation as possible as your base system, and have applications bring their own chunks of libraries that can run independently and with pre-defined isolation and resource constraints. The third approach is the one practiced by VMs and some container operating systems. When viewing your fleet of compute resources as a collection of resources instead of a bunch of individual machines, it makes a lot of sense to pursue the third approach.

In order to accomplish this, you need as bare minimum of an underlying system as possible. Gentoo began in the very early 2000s and had a very interesting capability, using the Portage build system, to build your system by hand from practically nothing. In the process you could optionally install only the exact components desired. Users of Gentoo had a stigma because its configuration power allowed people to do things that often either made no sense (-O99) or were sometimes detrimental (-funroll-loops). These kinds of settings or flags permeated many Gentoo installations. Like many tools, in the right hands a distribution that is designed to be built from source and only include exactly what the user dictates can be very powerful. ChromeOS eventually adopted a Portage base with a more enterprise ready build and configuration system.

From ChromeOS spawned two operating systems intended to be used in clusters to launch containers. The Container-Optimized OS from Google, and Container Linux by CoreOS. Both of these feature Kubernetes as their key container orchestrator and resource manager. The root image of a CoreOS installation can easily be on the order of 200MB. This means that the entirety of the core of a cluster takes up a footprint about the size of the Java Runtime Environment! (and about half the size of the JDK)

OS	Key Lineage Component
Gentoo	Portage package management Minimalist approach to dependencies Scorched earth recovery
ChromeOS	Scorched earth package and build configuration
Container-Optimized OS	Kubernetes pre-packaged / pre-configured + GCE support
CoreOS	More configurable Kubernetes pre-packaged and flexible image upgrade and configuration methodologies (exact relation to Container-Optimized OS is not clear)
MMX (CoreOS fork / patch set)	Mesos pre-packaged

The table above highlights a few components of various predecessors or those related to the current OS package we use. These are components that help make the current OS we use a fantastic option for cluster container deployment.

We run our own set of packages and configurations for CoreOS. The key difference is that we pre-package Mesos in addition to Kubernetes (but just use Mesos currently). Since most of our services run in the JVM, which acts as a kind of container itself, the history of our deployments is just gzipped tarballs with the jars and configurations packed in. Most of our services also natively use discovery libraries instead of relying on DNS + static port definitions. As such, the need to use Docker is minimal to none. Mesos allows you to simply specify resources to pull down and run in what they call a “Mesos container,” which was a natural fit to the way our services had been running for years. Mesos also has a lot of configurability with regards to how it uses and announces resources, and extensive support for various kinds of modules to extend its functionality.

Mesos additionally has a nice feature where the agents running on a machine are independent of the tasks which the agents run. This means you can make some changes, like upgrading versions, without taking down the services that agent is running. This feature is very handy for stateful services. The persistent volume support in Mesos has also been around longer, and Mesos supports declaring nodes as about to go under maintenance. All these together make it a bit more mature for certain types of workloads, but nowhere near as easy to set up as a GKE cluster.

So at this point we have a very robust but minimalist system core with a couple of options for how we manage our applications at a cluster level.

Resource Isolation

Around 2006~2007, Google had skipped the Virtualization Host/Guest paradigm and was instead pursuing better resource isolation at the Linux kernel level. In doing so they introduced the concept of control groups. The basic notion of control groups is that you do not need to let every process a kernel sees have access to every resource, but instead would want to have more control over your exposure of system resources. Over the years many of the ways to do resource isolation have gained support in the libcontainer project. We use only the isolations we need, which mostly includes memory isolation and cpu shares isolation. Disk resources are isolated per block device rather than trying to manage the IOPS of any independent block device.

One of the key challenges in the industry is how to expose the functionality of container isolation without having the configuration of utilizing these restrictions contain more lines of code than the application itself. The way we have worked resource isolation is to have the resource management system configuration on the machine have proper isolation of resources. And instead of allowing applications to have infinite flexibility in how they declare resource needs, we have a set of performance expectations settled a-priori and simply announce the resources at a concept level, having set up the underlying systems to adhere to the expectations.

As an example, if we have a machine tuned to be able to run Druid, it does not announce general resources. Instead it announces simple resources (cpu / mem / disk) and has *more* underlying configurations not exposed to the service that adhere to the service’s expectations. For example, if I have a mysql service I need to launch, and I know I want it to be guaranteed on a single numa zone, to have local SSD backing the persistent volume, and have at least a certain amount of network bandwidth, I would work with operations to figure out a resource descriptor I could use to guarantee these constraints. Then I would simply launch against the simplified numbers of CPU / Memory / Disk.

As another example, if I had another mysql database that is rarely used and has very lenient SLOs, I can work with operations to let them know I have relatively light and unimportant disk constraints, and CPU that can be best-effort. I will then get a tag to identify the resources that meet this constraint, and can schedule my tasks against these resources. Then my low-priority mysql database simply has to worry about CPU / Memory / Disk that is also labeled with my special tag. When resource collections of a type get too low, we try to monitor and make more available.

At this point we have a small but robust footprint to launch resources in the cluster, and a way to make those resources discoverable to the services that need them while still maintaining SLOs desired by the services.

Resource Orchestration

In order to manage resources we use Mesos with the coarse grained backend for Spark, and Marathon for Druid. The coarse grained backend for Spark allows us to upgrade or modify Spark on a per-job basis without needing to worry about version conflicts. Using Marathon for Druid gives us persistent volume support and upgrade strategies for stateful tasks built in. Spark on Mesos has performed great! The big thing we have to keep track of is what availability zone jobs run in to prevent cross-zone network transfer costs. The biggest problem we have with Mesos and Marathon for Druid is that changing nodes in the cluster (adding or replacing) is still more manual than we would like. All of these use the Mesos containerizer by simply downloading and extracting tarballs into the working directory and running java, which is included among the tarballs.

Results

The Good

In a previous blog post we wrote about using a large pool of compute in AWS.

Above is the change in per-day average CPU utilization across a chunk of our Druid cluster during and after the migration.

Below is the CPU utilization of a single host each minute of a typical day. The plateaus are at approximately 95% utilization.

Even better than higher utilization of our CPU infrastructure, we were able to take the resource pool used for Spark and combine it with the resource pool for Druid, giving BOTH services more CPU power at their disposal!

The Bad

The x1.32xlarge nodes feature four E7 8880 v3 processors. These processors are pretty beefy and have a lot of nice features. Unfortunately we’ve seen some really strange effects from the four socket architecture of the x1.32xlarge nodes. In one particular effect, one socket would be pinned at very high CPU utilization while the others would be much lower. Dynamic asymmetric NUMA zone performance (as opposed to zone-crossing considerations) is something I’m not even sure the Linux kernel scheduler handles in meaningful ways.

Since the exact measurement technique of htop is not obvious without digging into the code, this result was verified against /sys/fs/cgroup/cpuacct/cpuacct.usage_percpu deltas, which confirms this effect is real. Luckily these effects are short-lived but occur commonly enough to be a concern.

Looking at the Intel specification for MSRs in the (PDF warning!) System Programmer’s Guide, a key MSR of interest is the IA32_THERM_STATUS register at 0x19C (412 decimal). There is a correlation we found between the Power Throttling being active and this “high cpu %” effect on the socket being power throttled. There are a lot of tweaks available to control the processor state, of which only disabling turbo was attempted, which did not prevent this effect.

This random power throttling of cores, the effects of hyperthreads, and the heterogeneous nature of CPU load / availability / impact in a cluster-containerized architecture means CPU % is one of the least useful and most commonly misused metrics for a machine’s performance. What this means is that even though our average cpu utilization as measured by the guest OS is approximately 65%, getting to 100% (if that is even possible) will not yield a 50% increase in throughput on the same hardware. Even though CPU time is expensive, blindly aiming for 100% utilization will not yield the results you desire. The only helpful number for CPU % is 0%, which tells you either your metrics reporting is broken or there’s absolutely nothing going on. The way I read the CPU utilization graphs above is as follows: The weekly cycles are gone and we no longer dip down to near-0 utilization for extended periods.

The proper monitoring of knock-on effects to CPU (or other system resource) usage in a containerized environment is not something common in the industry.

The Ugly

For a portion of our cluster known externally as “Druid Basic,” we have significantly more page cache churn compared to our “Druid Ultra” offering. The “Druid Ultra” offering performed well most of the time, but the “Druid Basic” offering had significant problems when tested on this architecture.

The most common problem was page allocation failures from the Linux kernel. There are monotonically increasing counters for memory pressure warnings that surface for the mesos executors. Below is a time series (by hour) snapshot of what the warnings look like before a page allocation failure on a specific node (the vertical lines are days). The “spikey” nature of part of the graph is due to the lack of deduplication in this specific metric pipeline, and taking the “max” of the rolled up values. So if an event is sent twice in a given hour due to a small network blip, it will roll up to twice the “nominal” value. Orange and blue indicate Spark vs Druid data. The Druid tasks are generally longer lived, so their monotonically increasing counters reach higher values.

We tried cgroup memory isolation enabled and disabled but still encountered issues. We did not attempt to use cpu sets and memory controller affinity.

One of the page allocation failures we encountered during migration.

To try and eliminate memory issues we experimented with zone reclaim modes, disabling NUMA at boot, and a few settings to try and prevent fragmentation at the DMA level for the AWS ENA driver. Under NUMA architecture, it is possible that different NUMA nodes have different amounts of available memory. For example, by running numactl -H on the x1.32xl nodes, we found that the amount of free memory differs very much among the NUMA nodes.

We found that in most of our cases, if not all, it is Node 0 that always got the page allocation failure. By default, zone_reclaim_modes was set to 0, which means that no zone reclaim will happen and the memory allocation needs to be performed at other nodes. We tried setting vm.zone_reclaim_mode to 1 so that the cached memory can be reclaimed whenever there is memory shortage on the node, and increased the value of vm.min_free_kbytes in the hopes that increased minimal amount of free memory and zone reclamation might prevent the page allocation failure from happening.

Unfortunately, enabling zone reclamation and increasing minimal free memory didn’t help. Another theory we had was that high network activity led to the shortage of contiguous DMA pages. The ENA driver uses DMA buffers for data transfer. Because there are multiple Druid historical nodes that are actively loading/unloading segments and processing queries, in addition to multiple Spark jobs doing shuffles and etc., lots of data transfers happen on the node. We attempted to prevent too much fragmentation at the DMA level by increasing DMA protection with vm.lowmem_reserve_ratio. The default value of vm.lowmem_reserve_ratio is 256 256 32, where each number is a reciprocal number of ratio for each zone to protect. So we decided to lower the numbers to give more protection in each DMA zone. Unfortunately, this didn’t prevent the page allocation failure from happening.

At last, we tried disabling NUMA at boot to see if that helps with the situation. However, that didn’t help either.

The above is a snippet from the status of memory from each zone when the page allocation failure happened while NUMA was disabled. As you can see, there were lots of low-order pages but none greater than 64kB.

There are also numerous network socket memory tunings available which we did not tweak during this testing.

When paging in EBS volume data for Druid while doing background compute for Spark (which also uses page cache) we commonly encountered a nasty kernel bug. The state of the node after encountering the bug varied, but always ended up needing to be terminated. Try as we might, we have not been able to reproduce this bug in any synthetic environment, but can reproduce it with great regularity under certain production workloads.

As a final hiccup, our initial setups had RAID configurations for EBS volumes to give one giant disk view through the operating system, and Mesos would carve up the resources. This led to poorly tuned RAID setups that greatly harmed performance. Instead of trying to do detailed tuning of raid configurations, we resolved this by removing the RAID layer from our setup, instead announcing multiple disk resource paths through Mesos. We had observed high disk usage correlated with page allocation failures on x1.32 instances also, which fueled speculation that RAID contributed to the allocation failure root cause.

We ended up not using the x1.32xlarge nodes, opting for nodes with fewer NUMA zones, and removed any sort of RAID on the disks. This has effectively eliminated the problems with the kernel memory management. Now the nodes are humming along day in and day out at a pretty good pace, and we’re happy with the state of running Druid and Spark mixed workloads on the setup and configuration we’ve found.

Behind the Scenes of our Transition to a Multi-Cloud Environment

Himadri Singh — Thu, 07 Sep 2017 17:58:23 +0000

Service uptime is the performance metric that determines operational success and when something fails, the impact can be far reaching, often affecting a business’s bottom line. One of the downsides of running infrastructure in a public cloud is that we are dependent on the SLAs provided by our Cloud Providers. As a startup, we have been upgrading our systems to become a lot more fault-tolerant, but since our cloud infrastructure footprint is restricted to one region, and the oldest region of AWS at that, we are vulnerable to be bitten by cloud service blackouts or brownouts.

The most prominent solution offered by most of cloud providers is to distribute the workload over multiple regions to offer high availability. Measuring the efforts, the cost involved and man-hours required to make our service multi-regional were no less than making our infrastructure multi-cloud. A multi-cloud environment allows us to combine the elasticity and economic benefits of two different cloud providers.

Solution

Google & AWS both provide VPN solutions that can securely connect our VPC (Virtual Private Cloud) networks in different cloud infrastructures through an IPsec VPN connections, extending the private network across the public network. Traffic traveling between the two networks is encrypted by one VPN gateway, then decrypted by the other VPN gateway. This protects our data as it travels over the Internet, but VPN has its own limitations.

With our own product claiming sub-second latencies, the inter-cloud connectivity plays a critical role. Our dashboards demands predictable and fast request-response cycles from our resources. These perform best when the network latency remains consistent and low. We have services with consistently high demand for data throughput as well as latency sensitive applications, which requires a reliable and consistent network but the network latency over the internet can vary given that the internet is constantly changing. VPN services can provide the connectivity but fail to satisfy the requirements for consistent, performant and reliable network connectivity.

We chose to use AWS (Amazon Web Services) Direct Connect & GCP (Google Cloud Platform) InterConnect instead of establishing a VPN connection over the internet, avoiding the need to utilize VPN hardware that frequently can’t support data transfer rates above a few Gbps. Using AWS Direct Connect or GCP Interconnect, data that would have previously been transported over the internet can now be delivered through a private network connection with BGP failover capabilities. This helps us to achieve higher availability and lower latency connections between the clouds. With these solutions, we choose the data that utilizes the dedicated connection and how that data is routed, which provides a more consistent network experience over internet-based connections. Instead of slower network VPN circuits, the private network provides a more consistent network experience, reduces costs and increases bandwidth.

These solutions are designed to connect to an on-premise hardware. Since we only have presence in public clouds, we neither have the experience nor the will to manage physical hardware for the network. The hardware can be costly and would include time-consuming maintenance processes, requiring an experienced resource to manage just that hardware, which would add to our network management costs and ops-team requirements.

With the help of Google Professional Services and Equinix, we were introduced to Synoptek. They offer a Managed Performance Hub solution, which combines access to world-class data centers, the highest bandwidth connectivity available for private cloud connections, and highly rated management service from Synoptek. None of the hardware required to connect two major cloud providers together would have to be purchased or managed by Metamarkets. Instead, Synoptek owned and operated the entire solution. 10 Gbps links were drawn to AWS and GCP which were connected through Synoptek routers running in one of the Equinix Datacenters.

AWS Direct Connect makes it easy to establish a dedicated network connection from Synoptek Performance Hub to AWS. Google Cloud Platform uses Cloud Interconnect to establish enterprise-grade connections with higher availability and/or lower latency. Using industry standard 802.1q VLANs, this dedicated connection can be partitioned into multiple virtual interfaces. This provided us with a private, high bandwidth network connection between your network and your VPC. With multiple virtual interfaces, we can even establish private connectivity to multiple VPCs while maintaining network isolation, thus we were able to include a test VPC for evaluation purposes.

Performance

Transferring large data sets over the internet can be time consuming and expensive. With a private network running over a dedicated leased line, we can transfer our business critical data directly between the two cloud environments bypassing internet service and removing network congestion.

The Synoptek Performance Hub provided few millis latency for the multi cloud solution. With less than 20ms rtt (round trip time), the latency is consistent with high 10 Gbps connections. Using simple parallel iperf3 tests, we were able to validate the claims from the cloud providers and achieve 10 Gbps on the links.

The above graphs are from the Synoptek Logic Monitor for the Network devices we have added. We were able to achieve 10 Gbps throughput from both the links when pushing data from GCP to AWS (shown in green) but we were able to achieve around 10 Gbps in total when pushing data from AWS to GCP as it is unclear if AWS supports eBGP, providing load balancing at their end.

Costs

Operating in a multi cloud environment with bandwidth-heavy workloads that run over the network connection connecting the clouds, Direct Connect + InterConnect reduces the network costs into and out of the clouds. All data transferred over the dedicated connection is charged at the reduced AWS Direct Connect data transfer rate rather than internet data egress transfer rates.

Google Cloud Platform also offers discounted pricing for Cloud Platform traffic egressing through Cloud Interconnect links.

With simple pay as-you-go pricing, and no minimum commitment, it means we pay only for the network ports we use and the data we transfer over the connection, which can greatly reduce your networking costs for both AWS and GCP. The combination of Synoptek working with AWS, Google, and Equinix provided the customer-satisfying reliability and performance at a lower cost of owning and managing on-prem.

Without extra configuration, traffic to/from public resources such as Amazon S3 will still be routed over the internet, which incurs higher internet egress costs. We architected a redundant squid proxy solution running in a private address space to route all S3 traffic through the Synotek Performance Hub.

Redundancy

Each physical link consists of a single dedicated connection between ports on the Synoptek Performance Hub with Direct Connect router on AWS and InterConnect Cloud router on GCP. We also established a second connection to provide the required redundancy. When you request multiple ports at the same AWS Direct Connect location, they will be provisioned on redundant Amazon routers. We have multiple cloud routers on GCP providing the redundancy and distributing the network bandwidth.

With two kinds of VPN connections, static and bgp, we provide one more level of redundancy to the existing network connectivity. Each of these VPN connections have dual VPN tunnels to achieve better throughput. If InterConnect/DirectConnect goes down, we can failover to BGP VPN connections, which can failover to static VPN if required. For total network blackout, we need 4 levels of connectivity failures.

HA Failover

Since we have established a redundant connection, traffic will failover to the second link automatically. We have also enabled Bidirectional Forwarding Detection (BFD) when configuring the connections to ensure fast detection and failover. We also have configured a backup IPsec VPN connection in case both connects failed, at which point all VPC traffic would failover to the VPN (BGP) connection automatically.

The above graph shows the ping latency variation during the failover tests. There are blips when the dedicated connection was failover to other node and reconnected. But things go far worse when both redundant dedicated connections were taken down and had to failover to VPN, which caused series to erratically ping latencies. Once the connections were restored things smoothly transitioned. There was not connectivity loss during the failover tests.

Monitoring

Synoptek provided the logic monitor tool to monitor, evaluate and manage the hybrid solution. A number of metrics are provided by each of the monitoring tools:

Bandwidth Throughput
Packet Drops
Packets transferred
Bps

AWS has also added new Cloudwatch metrics for Direct Connect monitoring.

We have Cloudwatch alarms created for various conditions:

If any of the Direct Connect connection is down.
If any of the Direct Connect VIF (Virtual Interface) is down.
Direct Connect is receiving CRC errors.

StackDriver also provides a number of metrics for Interconnect and Interconnect Attachments and also allows us to create alerting policies around those.

We also wanted to get alerted if there was any traffic through our VPN connections for whatever reason. We deployed AWS solution to monitor VPN (https://aws.amazon.com/answers/networking/vpn-monitor/) with Cloudwatch alarms for:

If any of the VPN tunnels are down.
VPN tunnels are receiving in or sending out data.

GCP Stackdriver provides the dropped packets metrics for the VPN Connections along with the status and bytes sent/received. It has been very helpful to understand the erratic behavior of VPN. When we started to saturate the connections, we saw an increase in dropped packets.

Conclusion

The dedicated connections from AWS Direct Connect + GCP InterConnect have certainly improved the network quality between the clouds to support the high-throughput-demanding and latency-sensitive applications. We were able to deploy a scalable, maintainable and reliable network connectivity solution, helping us to become multi-cloud within months.