Going Multi-Cloud with AWS and GCP: Lessons Learned at Scale
August 15th, 2017 Charles Allen
Metamarkets handles a lot of data. The torrent of data that clients send to us surpasses a petabyte a week. At this scale, the ability to failover gracefully, to detect and eliminate brownouts, and to efficiently operate huge quantities of byte-banging machines is necessary.
We started and grew Metamarkets in AWS’s us-east region. And the majority of our footprint was in a single availability zone (AZ). As we grew, we started to see the side effects of being restricted to one AZ, then the side effects of being restricted to one region. It’s kind of like inflating a balloon in a porcupine farm, where you know it is a bad idea, but while you’re trying to figure out where to start inflating a new balloon, the prior one keeps on filling up with more air!
As we investigated growth strategies outside of a single AZ, we realized a lot of the infrastructure changes we needed to make to accommodate multiple availability zones were the same changes we would need to make to accommodate multiple clouds. After some looking around, we decided the Google Cloud Platform was potentially a very good fit to the way Metamarkets’ business and teams operate and the way some forces in the infrastructure industry are trending.
This post will cover some of the pragmatic differences we have experienced between AWS and GCP as cloud providers as of 2017. Some of the comparisons will be listed as unfair comparisons. In these instances we believe the level of service Metamarkets has subscribed to is different between the two cloud providers. In the interest of transparency, our primary operations in AWS are in us-east, which is the oldest AWS region and subject to a lot of cloud legacy both in users and (I suspect) internal hardware, and design.
Our Key Use Cases
While the rest of this post centers on some of the higher level considerations of GCP and AWS, it is worth calling out the key use cases Metamarkets uses each cloud for. During part of our initial investigations, node spin-up time on GCP was so fast that we found race conditions in our cluster management software. The distributed load-balancer intake methodologies GCP employs also means clients often hop on the GCP network very close to their point of presence. This combined with the per-minute pricing means GCP is a natural choice for things which scale up and down regularly relating to real-time data. Metamarkets runs all of our real-time components (for which we have our own throughput-based autoscaling) on GCP. For AWS, we have a large pool of compute resources running at high cpu utilization which also dip into spot market resources as needed. These compute resources in AWS use a combination of local disk and various EBS attached volume types depending on the SLO of the services using them. The workloads commonly used on our AWS instances are things which receive instructions for a chunk of local computations that need performed, then the aggregated results are sent back or shuffled to other nodes in the cluster.
A Note on IO and Network Disk
Input and output is one of the core bread-and-butter aspects of the cloud. Being able to push data to and from disk and network is core to how data flows in a cloud environment. It is worth noting that the general trend in the industry seems to be to push users onto network attached storage instead of local storage. The upside being the claimed (we’re big, but not big enough to have raw-disk failure statistics) failure rates for network attached is lower than for local disk. The downside is that when network attached disk has problems, you may see it on multiple instances at the same time, potentially taking them down completely (easy to manage) or inducing a brownout (hard to manage). When a local disk fails, the solution is to kill that instance and let the HA built into your application recover on a new VM. When network disk fails or has a multi-instance brownout, you’re just stuck and have to failover to another failure domain, which is usually in another availability zone or in some cases another region! We know this because this kind of failure has caused production outages for us before in AWS. This trend towards network attached storage is one of the scariest industry trends for big data in the cloud where there will probably be more growing pains before it is resolved.
AWS has the best offering for local disk solutions of the two cloud vendors. While the newest C4 / R4 / M4 instance classes are EBS only, the I3 / D2 / X1 series have excellent options for local disk.
At the time of this writing, the rate card for local SSD in GCP is much higher than similar storage in AWS, making it an option for people who absolutely require local SSD, but certainly an economically punishing option for those who must do so. (Late Note: a price change was just announced)
For performance, AWS offers a lot of options as far as your expected throughput, and the ability to burst capacity for short times. This allows for a lot of configuration options, and adds a lot of extra monitoring concerns that you need to account for. The end result is you can likely get highly specialized disk settings for your exact application needs, as long as you are willing to spend the initial effort to get tuning and monitoring correct. We do not do extreme disk tuning, and instead go for macro adjustments, which generally means using a particular disk class and changing either zero settings or poking simple optimizations. As such, we tend to use GP2 or SC1. Our experience with these two particular disk types is that the performance expectations can be inconsistent and can suffer from noisy neighbors and multi-VM brownouts or blackouts. We do not currently use provisioned IOPS. As such, as long as your disk throughput needs are within the bounds of skew caused by these effects, and you can fail over to an isolated failure domain, they make great options.
For GCP the options are more limited. But from our experience the network attached disk performs EXACTLY as advertised to the point of being shockingly to spec to where we haven’t had any need for deeper configuration options. We haven’t seen any hiccups in the disk layer in GCP yet, and only use network attached disk (persistent disk).
The networking is comprised of node to node networking, node to external networking and node to network-disk. The two clouds tackle the networking issue very differently, and such differences need to be taken into account depending on the needs of your applications.
For AWS, networking expectations is one of the hardest things to figure out. The instances give general specifications for “low,” “medium” and “high” network, or general limits like “up to 10Gbs.” If you read the fine print of the 10Gbs or 20Gbs instances, you’ll notice that you’ll only see those throughputs if you are using placement groups, which can be subject to freeze outs where you cannot get capacity. What this means is that your network throughput and consistency is going to be highly varied and hard to predict. In order to make full use of these networks you will have to have special network drivers on these machines that are not packaged with some older distributions, as well as enable special flags on your instances to enable the enhanced network. This leaves you with an extra critical software versioning you need to test, deploy, and keep track of. Luckily, linux distributions have started carrying more up to date AWS cloud network drivers by default.
Astute readers will note that disk reading over a network that has such loose guarantees of throughput is a recipe for unpredictability. AWS has mitigated this by having “EBS optimized” as an option which puts the network traffic for network attached storage in a different bandwidth pool. In our experience this eliminates contention for resources with yourself, but does not prevent upstream-EBS issues.
For GCP, networking per VM is both significantly higher than what is achieved in AWS, and more consistent. The achievable network capacity is based on the quantity of CPUs your VMs have. GCP shows a strong ability to define a spec and deliver on the throughput expectations. The balance here is that GCP has no dedicated bandwidth for network attached storage, but the total network available is higher.
Don’t let the reviews above scare you too much. With a little tuning, a disk-and-network heavy operation like kafka broker replacement, where we tend max-out the network bandwidth, can be made to show very strong top-hat characteristics for both AWS and GCP.
From a network logistics standpoint, GCP has an advantage in how it labels its zones. For GCP (at the time of this writing) the zone names are the same for everyone. In AWS the zones are shuffled around per account so that it is difficult to determine which zone corresponds to other zones. Additionally, the zones reported by spot prices on some of the billing and invoicing documents do not directly correspond to a particular zone’s alphabetical notation.
The fundamental technology for AWS EC2 VMs is xen, while the fundamental technology for GCP GCE VMs is kvm. The way each handles compute demand is significantly different from the guest OS perspective. As a simple example, AWS claims full exposure of the underlying NUMA topology on their latest series of i3 instances, whereas GCP has little to no information regarding NUMA considerations on their platform.
In practical usage in our clusters, Samza shows significantly more CPU skew for the same amount of work done on GCP compared to AWS where the CPU is more consistent from VM to VM. For VMs that run with relatively light CPU utilization (<60% or so), this will probably not be enough to affect the guest system. But for VMs that are intended to run near 100% CPU usage (like heavily bin-packed containers), and require approximately equal work done by different nodes over time, this can make capacity planning more challenging.
This difference is also apparent in some of the cloud service offerings. Kinesis from AWS is structured around shards which are similar to Kafka topic partitions. If you are only looking to parallelize your work, such a setup works best if your workers have nearly even work distribution to prevent any particular shard from falling behind. This also requires making sure the data going into the shards is nearly equal in “work” required to process it. SQS from AWS is a different message passing option which claim to have “nearly unlimited” TPS capabilities, but we have never tested the scaling limits for our use cases. On the GCP side PubSub is their all-in-one solution for both a distributed log and messaging, and billed on a bytes-through basis. We have not evaluated PubSub at our scale.
From a flexibility standpoint, the GCP offering of custom machine types is something we use extensively. The maximum memory per core available in GCP is not as high as in AWS (unless you want to pay a premium), but that has not affected our services running in GCP.
Our interaction with AWS support has traditionally been on only the most rudimentary level. This is largely because most support modes that operate on a percent of cloud spend become worse and worse deals for you as your cloud spend increases. You do not get economies of scale on your support. This leaves us on the lowest level of support.
To kick-start our utilization of GCP we engaged their Professional Services offering. Our interaction with the Google PSO team has been very positive, and I highly encourage anyone considering a serious undertaking on GCP to engage their representative about such opportunities. The PSO team we worked with was very good at addressing our needs on everything from best ways to get rid of toil to getting in experts to talk to our engineers about the current and future plans for some key GCP products. For outside of the PSO team, we have generally found the GCP support better at either directly addressing concerns, or identifying the specific parts they cannot help with. The GCP support has been a much more favorable interaction compared to our experience (at the support level we pay for) with AWS support.
For AWS, multiple “Availability Zones” are within a single “Region”. For GCP, multiple “Zones” are in a single “Region”. With how the machines are laid out, on AWS you have the option of getting dedicated machines which you can use to guarantee no two machines of yours run on the same underlying motherboard, or you can just use the largest instance type of its class (ex: r3.8xlarge) to probably have a whole motherboard to yourself. In GCP there is no comparable offering. What this means is that for most use cases, you have to assume in any particular zone that all of your instances are running on the same machine. We have had failures of underlying hardware or some part of networking simultaneously take out multiple instances before, meaning they shared the same hardware at some finer level than just availability zone.
In theory, any failures should be isolated per zone, and in rare scenarios a failure hits the entire region at once, and in incredibly rare scenarios it hits multiple regions. In practice what we have seen that multiple zones protects primarily against resource freeze out (unable to launch new VMs) and secondarily but weakly limiting the scope of failure of hardware. What it does not protect against is a particular cloud service having sudden issues across multiple zones. Unfortunately, we do not have stats on multi-region failures in either AWS or GCP. This means we cannot tell the difference between a local failure or a global one.
GCP has been a lot more forthcoming with what issues their services are experiencing. Whereas in AWS we will often see issues that go unreported or, worse, get told that there is no issue (see Support above). The transparency provided by GCP is more helpful for our team because it allows us to make better decisions about investing in cloud failover (problem on GCP side) or monitoring, detection, and service self-healing (problems on our side). If you take the self-reporting of incidents at face value, AWS often states the blast radius of an issue as confined to a specific region, whereas GCP has more claims of global issues 1 2.
Both providers offer SLAs for different services and both want you to use their vendor-specific offerings for various items as core parts of your technology. But both the GCP SLA and the AWS SLA only cover the services affected. This means if a failure in blob storage causes your expensive compute time to be wasted, then the SLA might only cover the blob storage cost. You should discuss SLAs with your account representative if you have any questions or need further explanations.
A unique feature for GCP is the ability to migrate your VMs to new hardware transparently. This live migration is something we were very hesitant about at first, but in practice when migrating a chunk of kafka brokers in the middle of broker replacement, none of our metrics could even detect that a migration had occurred!
Wow… billing. The fundamental way in which AWS and GCP bill is very different. And getting a handle on your cloud spend is a huge hassle in both AWS and GCP. AWS provides a pre-canned billing dashboard which provides basic macro insights into your bill. GCP provides estimates exported into BigQuery, upon which you can build Data Studio reports on your own. We do not find either of these sufficient and instead opt to use our own expertise to build beautiful interactive data streams on our cloud billing data. Since the Metamarkets user interface is designed to help make sense of highly dimensional time series data, putting the cloud invoices into this system is a natural choice and has worked out very well so far.
For AWS cloud spend is accrued as line items where different line items have rate identifiers whose rates are multiplied by the consumed resource quantities. This follows a very predictable denormalized data scheme that is compatible with multiple analysis tools.
For GCP billing exports into BigQuery, each line item is an aggregate over a time period of accrued usage (many GCP-internal calculations are rolled up to Day boundaries at the time of this writing), and has sub-components of credits. For example, if you run a n1-highmem-16 at the standard rate card of 0.9472 dollars/ hr for one day. You will see a line item for 22.7328 dollars for
Highmem Intel N1 16 VCPU running in Americas
” with a usage of
86400 seconds. If you run the same instance throughout the month, you will eventually start to see a credit for
Sustained Usage Discount
” as an item in the nested list of credits for that resources usage, beginning on whichever time-slice it starts to get applied. Subtract the sum of all the credits from the sum of the usage costs and you have what your final bill will be. This has two major disadvantages: 1) auditing is very hard; you usually have to just take the numbers as presented (in our experience the numbers are correct, just hard to calculate independently). 2) calculating an “estimated spend this month” kind of projection is very challenging, which makes your finance team cranky.
This section is focused more around the higher level aspects of cost, and not the specific rates Metamarkets pays for different services.
The strategy for AWS is largely around instance reservations. With the recent addition of convertible reservations and instance size flexibility, it makes experimenting with more efficient instance configurations much easier. The only downsides we’ve encountered with this program are occasional instance type freeze-out in a particular zone due to lack of availability, and the complexity of handling convertible reservations that do not all have the same start date. We find the flexibility provided by these features very much worth the wait for capacity to become available. Upgrades to instance types or pricings tend to go on about 12 to 16 month cycles. Ask your AWS representative if you have any questions or concerns along these lines.
For GCP, the strategy seems to be headed toward committed use discounts and sustained usage discounts with a premium for specific extended compute or extended memory needs. This allows for quite a bit of flexibility in how your clusters are configured, and provides a natural way to transition from independent VMs to a containerized environment.
For transient instances each provider has slightly different solutions. For GCP the preemptible instances are an alternative to running things with a guaranteed tenancy. An interesting feature of the GCP preemptable instances is that they are terminated if left up for 24 hours. This makes having a budgeted spend pretty straight forward, and helps make sure you are not doing crazy things on the preemptable instances that you shouldn’t be doing. For AWS the offering is around the spot market. We love the spot market but it does make your monthly bill very hard to predict. The nature of the transient VM offerings means that the capacity available for any particular task is going to be a function of how long the task needs to run, and how many resources the task needs. If you are going to go down the route of extensively using the transient instances, make sure you have the ability to migrate your workload among different resource pools.
Both cloud providers have excellent security features for data and we have never been concerned about security of the cloud providers themselves. This section is dedicated more to the ease of management and a few feature differences between the cloud providers.
AWS has very detailed IAM rules that, in general, are focused on functions performed against resources. For example, you have different things you can do to a resource such as get, list, describe, edit, and a host of other things. At the time of this writing S3 supports 20 different operations you can perform against an object in S3. This means you are probably going to go down the route of granting large swaths of rights to some IAM roles, and just granting one or two to others.
In GCP, the IAM is centered more around pairing logins or IDs with intentions against a resource. Groups are more a logical construct to make the management of the IDs easier, and instead of detailed operations that can be done against resources, intentions against the resource are expressed as “Roles.” While there is mixed support for fine grained access controls, the general use cases are going to be against pre-canned roles and intentions such as “Viewer,” “Subscriber,” “Owner” or “User.”
As far as being the target of attacks, we noticed a significant difference between AWS and GCP. If you search your
sshd logs for the phrase,
POSSIBLE BREAK-IN ATTEMPT!
”, the quantity of attempts in GCP is dramatically higher than in AWS. In GCP we typically see somewhere around 130,000 break-in attempts every day. In AWS it is on the order of a few hundred.
Both clouds offer a form of a Key Management Service. We use this to store secrets in blob storage (S3 / GS) in an encrypted form. The secrets are usually DB passwords or the secret component of cloud keys for the other cloud (AWS secrets encrypted and stored on GS, and GCP secrets encrypted and stored in S3). Read access to the blob storage and decryption rights against the key is limited to specific machines (machine-role in AWS and service-account in GCP) so that specific instances can read and decrypt the secret to authorize against the other cloud. Both clouds have very workable interfaces. The transparent decryption AWS offers for S3 is very easy to use and gives AWS’s KMS solution an advantage in our use case.
Our largest compute footprint runs on home-grown modifications to CoreOS (close cousin of the GKE COS) adding Mesos support. For some of our other service clusters we are investigating cloud container systems. In early investigations GKE is much easier to adopt and has better high level feature than ECS. But the networking connectivity restrictions in GKE are very limiting for a migration of services from a non-containerized environment to a containerized one (something being actively addressed). The problem related to CPU skew also makes tightly packed GKE nodes more worrisome. I’m bullish that the cloud providers will come up with increasingly better solutions in this area.
The monitoring features for AWS are exposed through CloudWatch, and for GCP are exposed through Stackdriver. These can both provide basic dashboards but lack the real-time slice and dice needs of our team. So we use our own Metamarkets products to monitor the metrics coming off our machines. For logging we found that Stackdriver can provide some interesting information by having access to details at the load balancer level, but for the vast majority of our logging needs we export data to SumoLogic. Neither CloudWatch nor Stackdriver have the understanding of containerized services as a first-class assumption. As containerization orchestrators such as Mesos and Kubernetes gain more popularity, this is an area I’m hoping to see more innovation in down the line.
One of the odd aspects about GCP was that many of the features of interest for our use were pre-GA. This left us with a strange choice, where we had to determine if going with a pre-GA offering was more risky or less risky than developing an alternative in-house. It is worth noting that Gmail was in beta from 2004 to 2009, a hefty testing timeline. So a common question we would ask our account representative was “Is this real-beta or gmail-beta?” Most of the time, pre-GA items were determined to be more stable and reliable than what we could cook up as an alternative in a short time.
In general AWS has a higher quantity of more mature features, but the features GCP is publishing tend to come with less vendor lock. This also means that you can try-out the public versions of many of the GCP offerings without any spend on the GCP platform itself, which is very valuable to feed the natural curiosity of developers.
The AWS auto scaling groups function close to how we traditionally operate scaling needs. We don’t really use any auto-scaling capabilities, but use ASGs as a way to do instance accounting. Being able to modify the instance count in the UI is very handy. In GCP, the instance groups have a nasty side effect where you cannot leave the instance quantity unspecified in deployment manager, so it is very easy for one operator to scale an instance group, and another to push a different count through deployment manager.
The GCP web UI is a little more modern and feels snappier, though the recent updates to the AWS console are a huge improvement over the prior version. The in-browser SSH sessions in GCP are also very nice. For building instance templates themselves, the ability to just plop a file into GCS and use that as your instance root image is very handy feature for GCP.
At Metamarkets, we believe in the cloud. More specifically, we believe that one day soon people will think of servers the same way they think of circuits. Our technology investments are aimed at making the connectivity of data to insight completely seamless. By exercising the advantages of various cloud providers, Metamarkets is better positioned to adjust to a changing world and adapt our compute needs as the ravenous desire for data insight continues to grow.