Cloud Performance Hit of Meltdown and How to Address it in Five (Easy) Steps
2018 has started with a bang. Just a few days into the new year, the entire tech industry worldwide was hit by the news of widespread security vulnerabilities – appropriately named Meltdown and Spectre – and the related patches that had to be deployed to mitigate the vulnerabilities.
Why this matters
The whole industry was shaken by the news. What’s really unusual about this problem is the sheer scope of it: pretty much every computer around the globe is affected, as the problem goes back 20-25 years and encompasses both new and legacy systems. Further complicating matters is the fact that these are fundamental hardware architecture problems impacting all Intel, ARM and (for Spectre) AMD systems, which means these vulnerabilities are not so easily solvable by software.
Before we start, let us just say a word of thanks to the thousands of engineers who worked under tremendous pressure, under embargo for months and over the holidays, to develop and deliver all the patches for all the relevant components of the stack, from operating systems to microcode, to hypervisors, storage, drivers, etc.
The technical nature of the vulnerability, brought to light by ProjectZero team at Google, is quite complex but essentially involves CPU speculative executions and the possibility of exploiting that to access protected memory. Jann Horn of Google's Project Zero discovered both of the flaws as part of his research; he detailed them in this blog post.
Why is this particularly important to Nouvola? First of all, we aim at bringing some clarity to our customers, as there is a lot of confusion in the industry due to the complex technical nature of the problem and to its implications impacting many areas of the stack. But most important, it’s a valuable example of how security and performance have collided to create the perfect storm.
It’s known that security and performance can be sometimes conflicting requirements, but it’s hard to imagine another case with an impact as profound as this one.
Specifically, for the cloud, the vulnerability is extremely critical as it could possibly expose a memory space of a user running in a virtual machine (VM) on a physical machine to another user running another VM on the same physical machine. This is what is usually called an instance-to-instance or instance-to-host concern. Instance-to-instance concerns assume an untrusted neighbor instance could read the memory of another instance or the cloud hypervisor.
Although there has been some talk about the impact on AWS users, it’s worth mentioning that the problem is not limited to AWS. It in fact affects any cloud provider, including Azure and GoogleCloud, and affects tier-2 cloud providers also. Both Azure and GoogleCloud have updated their infrastructure, and tier-2 cloud providers have banded together to cope with Meltdown and Spectre.
Since this is fundamental architectural flaw in the CPU design, there has been a scramble to mitigate and prevent this hardware behavior from happening via software changes. For the sake of this discussion, we will focus on Linux since, in its various distributions, it’s how the large majority of cloud deployments are done today.
For Meltdown, the solution to this problem has been to implement a Linux kernel patch called KPTI (Kernel Page Table Isolation - formerly called Kaiser) which is enabled and facilitated by a hardware feature called PCID (process-context identifiers). KPTI essentially separates kernel memory from user memory. This patch was already in development and it’s been accelerated and is now included in Linux kernel 14.5. Backports to previous release through standard distributions have also been provided.
Let’s pause here for a second. This is not a simple bug fix. This is a fundamental change to how the kernel's memory management works. Usually this kind of thing would be discussed for years before getting introduced, especially given its associated performance impact.
In the words of the Linux community:
"KAISER will affect performance for anything that does system calls or interrupts: everything. Just the new instructions (CR3 manipulation) add a few hundred cycles to a syscall or interrupt. Most workloads that we have run show single-digit regressions. 5% is a good round number for what is typical. The worst we have seen is a roughly 30% regression on a loopback networking test that did a ton of syscalls and context switches."
It is worth mentioning that the state of mitigation for Spectre is severely behind, and KPTI is not particularly effective at mitigating the Spectre vulnerability, which is proving to be much harder to mitigate.
It’s pretty clear that such a fundamental change to operating system behavior would lead to a performance hit for cloud workloads, although it probably won’t be apparent for some time how individual workloads are affected differently.
Most people have observed performance degradation by the cumulative impact of all the patches, ranging from microcode to operating system to hypervisors, although AWS has declared that the large majority of EC2 workloads shouldn’t see an impact. This is the security bulletin that AWS issued, which continues to be updated as more information becomes available.
“We have not observed meaningful performance impact for the overwhelming majority of EC2 workloads.”
Linus Torvald indicated we should expect about 5% overall impact on performance. However, it will all depend on the support of PCID in newer machines. And some patches won’t be able to take advantages of PCID, even on newer hardware.
In real life, people have seen a substantial impact, way beyond 5%, and dramatic increase in CPU utilizations – as reported in this AWS forum. Other vendors have indicated performance hits could be between 5 and 50%, particularly related to I/O intensive workloads.
Since this is a fundamental change to the operating system, the resulting performance issues are not limited to instances: they spread to databases (Postgres SQL), and other cloud services. Redis for example, was said to have slowed down by 7%. In an interesting development, Google released a mitigation called Retpoline in an attempt to manage performance issues and has already implemented it in Google Cloud Platform.
What Should I do?
As some have put it, the world just lost 30% of its total performance / capacity in one day. And as the dust is still settling from these vulnerabilities, we’ve been hearing from some of our customers who are at loss as to how to tackle this problem. So we’d like to offer some suggested actionable steps that you can take to protect the security and optimize the performance of your cloud deployments.
Before you even start tackling performance concerns, make sure your instances are up-to-date with the latest patches. Your cloud provider has already updated the infrastructure, but you need to take steps to upgrade your own VM with the latest patches.
Then, let’s talk about performance:
1. Re-Baseline. Real workloads are different than benchmarks, and ultimately the only thing that matters is, well, your own workload. The first order of business is re-baseline everything. Irrespective of what your cloud provider says, you need your own data. Use synthetic testing to run critical scenarios and happy paths, and measure at increasing scale. Measure end-to-end performance but also individual endpoint performance, to see whether a slowdown can be narrowed down to a discrete number of endpoints. Have the data in hand: response time, throughputs, as well as infrastructure data like CPU utilization, bandwidth, I/O.
2. Compare. If you have a previous baseline, compare the two. (This, by the way, is why it is important to use proactive performance testing solutions like Nouvola to have a current baseline with all your critical scenarios at scale. Here’s to always keeping a current baseline of your system!)
3. Test specific components. Test also specific components in isolation that might be at higher risk, such as I/O heavy modules, databases, services that might be the bottlenecks. If you have data to compare with, you might want to focus on the areas that have been mostly affected.
4. Re-evaluate your infrastructure. If you identify specific bottlenecks or increases in your infrastructure utilization, you may want to consider resizing your instances, or changing instance type. Whatever you decide, do a small deployment first and run a few experiments there with the scenarios that were affected. You might also look at ways to reduce the impact of the performance hit, by modifying your code to reduce system calls, for instance.
5. Retest everything. If you decide to change something, you need to retest everything. Changing an instance may or may not have the desired effect, or may not entirely solve the problem. Run your synthetic scenarios at scale again to have a clear “Before and After” data set, enabling data-driven discussion with your team and with management. Don’t run only the scenarios that were affected. You need to retest everything to make sure there weren’t any unintended consequences of your changes. The data will give you also more clarity of what really happened.
The Meltdown and Spectre vulnerabilities (and the mitigation efforts to date) have resulted in an unprecedented worldwide hit to performance. While security is obviously paramount, these vulnerabilities have also reinforced the critical importance of performance testing. If anything, the need to respond to the performance impact of these vulnerabilities has made it even clearer that diligently maintaining good performance testing processes at all times will give you an advantage if (or when) similar scenarios arise in the future.
We’ll continue to post on the Nouvola blog as new information becomes available.
How has Meltdown affected you? Get in touch to share your thoughts and/or let us know how we can help.