Behind the Scenes of our Transition to a Multi-Cloud Environment
September 7th, 2017 Himadri Singh
Service uptime is the performance metric that determines operational success and when something fails, the impact can be far reaching, often affecting a business’s bottom line. One of the downsides of running infrastructure in a public cloud is that we are dependent on the SLAs provided by our Cloud Providers. As a startup, we have been upgrading our systems to become a lot more fault-tolerant, but since our cloud infrastructure footprint is restricted to one region, and the oldest region of AWS at that, we are vulnerable to be bitten by cloud service blackouts or brownouts.
The most prominent solution offered by most of cloud providers is to distribute the workload over multiple regions to offer high availability. Measuring the efforts, the cost involved and man-hours required to make our service multi-regional were no less than making our infrastructure multi-cloud. A multi-cloud environment allows us to combine the elasticity and economic benefits of two different cloud providers.
Google & AWS both provide VPN solutions that can securely connect our VPC (Virtual Private Cloud) networks in different cloud infrastructures through an IPsec VPN connections, extending the private network across the public network. Traffic traveling between the two networks is encrypted by one VPN gateway, then decrypted by the other VPN gateway. This protects our data as it travels over the Internet, but VPN has its own limitations.
With our own product claiming sub-second latencies, the inter-cloud connectivity plays a critical role. Our dashboards demands predictable and fast request-response cycles from our resources. These perform best when the network latency remains consistent and low. We have services with consistently high demand for data throughput as well as latency sensitive applications, which requires a reliable and consistent network but the network latency over the internet can vary given that the internet is constantly changing. VPN services can provide the connectivity but fail to satisfy the requirements for consistent, performant and reliable network connectivity.
We chose to use AWS (Amazon Web Services) Direct Connect & GCP (Google Cloud Platform) InterConnect instead of establishing a VPN connection over the internet, avoiding the need to utilize VPN hardware that frequently can’t support data transfer rates above a few Gbps. Using AWS Direct Connect or GCP Interconnect, data that would have previously been transported over the internet can now be delivered through a private network connection with BGP failover capabilities. This helps us to achieve higher availability and lower latency connections between the clouds. With these solutions, we choose the data that utilizes the dedicated connection and how that data is routed, which provides a more consistent network experience over internet-based connections. Instead of slower network VPN circuits, the private network provides a more consistent network experience, reduces costs and increases bandwidth.
These solutions are designed to connect to an on-premise hardware. Since we only have presence in public clouds, we neither have the experience nor the will to manage physical hardware for the network. The hardware can be costly and would include time-consuming maintenance processes, requiring an experienced resource to manage just that hardware, which would add to our network management costs and ops-team requirements.
With the help of Google Professional Services and Equinix, we were introduced to Synoptek. They offer a Managed Performance Hub solution, which combines access to world-class data centers, the highest bandwidth connectivity available for private cloud connections, and highly rated management service from Synoptek. None of the hardware required to connect two major cloud providers together would have to be purchased or managed by Metamarkets. Instead, Synoptek owned and operated the entire solution. 10 Gbps links were drawn to AWS and GCP which were connected through Synoptek routers running in one of the Equinix Datacenters.
AWS Direct Connect makes it easy to establish a dedicated network connection from Synoptek Performance Hub to AWS. Google Cloud Platform uses Cloud Interconnect to establish enterprise-grade connections with higher availability and/or lower latency. Using industry standard 802.1q VLANs, this dedicated connection can be partitioned into multiple virtual interfaces. This provided us with a private, high bandwidth network connection between your network and your VPC. With multiple virtual interfaces, we can even establish private connectivity to multiple VPCs while maintaining network isolation, thus we were able to include a test VPC for evaluation purposes.
Transferring large data sets over the internet can be time consuming and expensive. With a private network running over a dedicated leased line, we can transfer our business critical data directly between the two cloud environments bypassing internet service and removing network congestion.
The Synoptek Performance Hub provided few millis latency for the multi cloud solution. With less than 20ms rtt (round trip time), the latency is consistent with high 10 Gbps connections. Using simple parallel iperf3 tests, we were able to validate the claims from the cloud providers and achieve 10 Gbps on the links.
The above graphs are from the Synoptek Logic Monitor for the Network devices we have added. We were able to achieve 10 Gbps throughput from both the links when pushing data from GCP to AWS (shown in green) but we were able to achieve around 10 Gbps in total when pushing data from AWS to GCP as it is unclear if AWS supports eBGP, providing load balancing at their end.
Operating in a multi cloud environment with bandwidth-heavy workloads that run over the network connection connecting the clouds, Direct Connect + InterConnect reduces the network costs into and out of the clouds. All data transferred over the dedicated connection is charged at the reduced AWS Direct Connect data transfer rate rather than internet data egress transfer rates.
Google Cloud Platform also offers discounted pricing for Cloud Platform traffic egressing through Cloud Interconnect links.
With simple pay as-you-go pricing, and no minimum commitment, it means we pay only for the network ports we use and the data we transfer over the connection, which can greatly reduce your networking costs for both AWS and GCP. The combination of Synoptek working with AWS, Google, and Equinix provided the customer-satisfying reliability and performance at a lower cost of owning and managing on-prem.
Without extra configuration, traffic to/from public resources such as Amazon S3 will still be routed over the internet, which incurs higher internet egress costs. We architected a redundant squid proxy solution running in a private address space to route all S3 traffic through the Synotek Performance Hub.
Each physical link consists of a single dedicated connection between ports on the Synoptek Performance Hub with Direct Connect router on AWS and InterConnect Cloud router on GCP. We also established a second connection to provide the required redundancy. When you request multiple ports at the same AWS Direct Connect location, they will be provisioned on redundant Amazon routers. We have multiple cloud routers on GCP providing the redundancy and distributing the network bandwidth.
With two kinds of VPN connections, static and bgp, we provide one more level of redundancy to the existing network connectivity. Each of these VPN connections have dual VPN tunnels to achieve better throughput. If InterConnect/DirectConnect goes down, we can failover to BGP VPN connections, which can failover to static VPN if required. For total network blackout, we need 4 levels of connectivity failures.
Since we have established a redundant connection, traffic will failover to the second link automatically. We have also enabled Bidirectional Forwarding Detection (BFD) when configuring the connections to ensure fast detection and failover. We also have configured a backup IPsec VPN connection in case both connects failed, at which point all VPC traffic would failover to the VPN (BGP) connection automatically.
The above graph shows the ping latency variation during the failover tests. There are blips when the dedicated connection was failover to other node and reconnected. But things go far worse when both redundant dedicated connections were taken down and had to failover to VPN, which caused series to erratically ping latencies. Once the connections were restored things smoothly transitioned. There was not connectivity loss during the failover tests.
Synoptek provided the logic monitor tool to monitor, evaluate and manage the hybrid solution. A number of metrics are provided by each of the monitoring tools:
- Bandwidth Throughput
- Packet Drops
- Packets transferred
AWS has also added new Cloudwatch metrics for Direct Connect monitoring.
We have Cloudwatch alarms created for various conditions:
- If any of the Direct Connect connection is down.
- If any of the Direct Connect VIF (Virtual Interface) is down.
- Direct Connect is receiving CRC errors.
StackDriver also provides a number of metrics for Interconnect and Interconnect Attachments and also allows us to create alerting policies around those.
We also wanted to get alerted if there was any traffic through our VPN connections for whatever reason. We deployed AWS solution to monitor VPN (https://aws.amazon.com/answers/networking/vpn-monitor/) with Cloudwatch alarms for:
- If any of the VPN tunnels are down.
- VPN tunnels are receiving in or sending out data.
GCP Stackdriver provides the dropped packets metrics for the VPN Connections along with the status and bytes sent/received. It has been very helpful to understand the erratic behavior of VPN. When we started to saturate the connections, we saw an increase in dropped packets.
The dedicated connections from AWS Direct Connect + GCP InterConnect have certainly improved the network quality between the clouds to support the high-throughput-demanding and latency-sensitive applications. We were able to deploy a scalable, maintainable and reliable network connectivity solution, helping us to become multi-cloud within months.