Effect of Frequency Governor on Java Benchmarking
February 26th, 2015 Charles Allen
A very common tool in a programmer’s arsenal is a MacBook Pro (MBP). However, a major OSX drawback for a developer is the lack of easy, fine grain control over kernel behaviors similar to that found in machines kerneled with Linux or raw BSD. In this post, we will explore the effect of the frequency governor on MBPs running with a modern Intel chip. The wall-time query execution speed in Druid will be used as a simple java benchmark.
One of the most common tasks during the course of evaluating code is to look at key bottlenecks in execution time. Like many other developers, we run most basic benchmarks and profiles on our local development machines. There are a number of tools, including JUnitBenchmarks, Caliper, and JMH to assist in benchmarking small units of code. Internally we often use RDruid to collect statistics about different patches against Druid. More information can be found in the performance blog post. The downside of this technique is that benchmarks tend to have odd multi-modal distributions, which can be infuriating and frustrating when trying to optimize software. Without a clean execution time distribution, it is difficult to tell if your code or machine is behaving consistently among repeated task executions. It is of great benefit to understand where these per-execution discrepancies come from.
In an effort to gather better results on benchmark execution time, we decided to investigate the frequency governor. In order to provide the most efficient energy usage, modern kernels can scale the CPU frequency up and down in response to system load or other system states (like scaling down if your CPU to prevent thermal shutdown or the release of the magic smoke). In the recent versions of OSX, the governing of frequency scaling is largely controlled by a combination of the Xnu CPU Power Management and embedded Intel power scaling logic. Neither of these are eager to expose tweaking knobs in OSX unless you’re willing to make major changes at the UEFI level. With a severe lack of control over the frequency of the CPU during performance testing, Intel at least provides a very nice tool for monitoring the CPU frequency scaling. For the tests in this post, I used the Intel Power Gadget 3.0.1 for Mac which allows logging of data such as timestamps of frequency changes and the new value your CPU is running at.
CPU Frequency (open circle) and query execution time (red dot) are plotted as a function of time above. Most notable, and very expected, is that when CPU frequency tends higher and more consistent, execution time tends lower and more consistent. The timestamp reflects the minute and second of the hour during which the test was run.
To measure a baseline performance, we spun up an instance on Amazon’s EC2 and the same druid settings and dataset were used on that instance. Then, the start wall-time and stop wall-time were collected for each data task. These query times were then compared against the frequency scaling dataset in order to approximate the number of CPU cycles used by the query (assuming only one user CPU second per second).
This leaves us with three key datasets: 1) the wall time on EC2, 2) the wall time on my local machine, 3) the estimated cycle count on my local machine. In order to get such drastically different results (and units) comparable, each data set was normalized with respect to its own median.
As can be seen above, the tests run on the EC2 instance (red) give the clearest and tightest results, while the raw wall time results run on the MBP (blue) have a multi-modal distribution. MBP results, which are adjusted to estimate raw CPU cycle count (green), give a much cleaner distribution than the raw wall time values. The MBP Adjusted values do not have as low of a relative variance as the EC2 results. This is believed to be due to the fact that the simplistic integration method for adjusting from wall time to approximate CPU clocks does not properly capture the minutia of exactly how many CPUseconds were used by the task between frequency hops. The data and code are available online.
This high variance and multi-modal distribution has a great impact when trying to determine if small improvements are truly improvements. The simplest approach is to run a few tests against a master branch and a feature branch of some code and compare the results. Of particular note is that a simple t-test of a small number of response times can easily be wrong. If you are looking for that 1% improvement, you probably aren’t going to get reliable results running benchmarks on your local MBP. The GOOD news is that in practical experience, the medians tend to give correct information over large enough sample sizes for improvements greater than a few percent. So, that means that even if a stable testing environment isn’t available, you can get a pretty good idea of the changes your code makes by simply getting more data. Waiting for those extra benchmarks to finish on your development machine will leave you with one burning question: “How much will browsing reddit throw off my results?”