How to Evaluate Remote Caching and Execution¶
Between roughly 2006 to 2008, Google developed remote caching and execution technologies to scale its massive monorepo based software development operation. This platform included Forge, the remote caching and execution platform, and Blaze, a tool for building large multilanguage software projects, eventually open sourced as Bazel. The advantages of this original platform were so obvious that it literally sold itself, and ultimately inspired the EngFlow platform.
Today, EngFlow is one of several competing remote caching and execution products now available in the commercial space. This post describes how we continuously benchmark our own product against different configurations to ensure that we offer the best possible value. We hope that sharing our methodology might help you evaluate whether remote caching or execution is right for your organization.
Using EngFlow's repository to evaluate the different platforms¶
Our own codebase is a good example of the kind of large, polyglot monorepo that our current and prospective customers wrestle
with. It's comprised of approximately the following number of BUILD
targets
and nonempty lines of code, including vendored third party dependencies:
Language | # Targets | # Lines |
---|---|---|
Java | 912 | 353844 |
Python | 109 | 10697 |
Go | 780 | 1103170 |
Bash | 74 | 1702 |
C++ | 89 | 89318 |
Rust | 19 | 513 |
TypeScript | 330 | 72672 |
Protobuf | 324 | 7342 |
Total | 2637 | 1639258 |
While our own code base isn't the largest, it's of sufficient size and complexity to reflect the market we aim to serve with our product. Consequently, we believe that running an analysis against our own monorepo's CI provides more realistic insights than running against a sample of Open Source repositories.
We got these numbers from the following Bazel query script, which counts conventional binary, library, and test targets and ignores blank lines. It considers lines containing only parentheses, brackets, and comment delimiters to be blank as well.
We invoke this script using a wrapper script containing our language patterns:
engflow-build-targets-and-loc.sh | |
---|---|
Tools¶
We naturally adopted Bazel from the beginning of the company. This isn't surprising, considering how many former Blaze/Bazel developers and other former Googlers are on our team. On the other hand, it was the right tool for the job, given the scope and complexity of the system we aimed to develop.
We use GitHub Actions for continuous integration (CI), with remote caching and remote execution enabled. There's nothing revolutionary here, except for the parallel remote execution and caching enabled by our platform.
Methodology¶
There are many variables that contribute to overall performance and cost. Here is how we chose to control and observe a few key variables while maintaining a reasonably realistic experiment.
Load¶
We started by defining the load we want to throw against our remote caching and execution clusters. We considered two options: mirroring our full CI load against all clusters, or using only a subset of our actual load for benchmarking.
In our primary CI pipeline, we run our tests on Linux, macOS, and Windows on
both x86_64
and arm64
. (We also run a few "special" pipelines that use
additional static analysis tools or integration tests against actual cloud
services.) This allows us to test every pull request thoroughly and gives us
high confidence that our changes don't end up causing issues in production.
However, mirroring all our CI jobs against multiple clusters would create N
times the load, costing N times as much as our regular CI pipeline. We want to
use the shadow pipelines to collect data continuously, so we chose the subset
option to keep our cloud cost manageable.
We also decided to simplify our shadow pipelines to only run bazel build
//...
. While this doesn't give us a perfect mirror of what we do on CI, we
believe it is a good enough approximation.
Platform¶
We also had to decide, before running our benchmarks, on which of our supported cloud providers we wanted to host the experiment. Since most of our regular CI runs on Amazon Web Services (AWS), this decision was quickly settled in favor of Amazon's cloud offering.
Accounting¶
Lastly, we had to decide whether to create a dedicated AWS account for each benchmarking cluster or to deploy everything in a single account. Here, we took inspiration from our production clusters and gave each cluster a dedicated cloud configuration. We don't really need the isolation aspects of dedicated AWS accounts for our experiments, since all of the clusters are running the same internal builds. However, this setup allows us to easily grasp the infrastructure cost of each different configuration.
Conclusion¶
We're measuring the performance impact of code changes at a larger scale, as well as the total cost of ownership. Running each cluster in its own sandbox in the cloud produced a win-win situation for achieving these goals. The fact that this configuration more closely matches what we do in production is a nice benefit as well.
Collecting data from our actual CI builds over time helps us reason about the experience, performance, and associated costs of EngFlow. Isolated benchmarks and curated load testing has its place, and these techniques are helpful for us to identify bottlenecks and improve our products. But ultimately, we want to understand performance and cost tradeoffs from our customers' (and potential customers') perspective, which this methodology allows.
If you're considering running a similar experiment on your own code base, we'd love to hear about it! Following this methodology may prove more valuable than trying to interpret or recreate our results with repositories that might not be representative of your own.
And, of course, we'd love to help you set up an EngFlow trial to include our product in your analysis. Just visit the EngFlow home page and click the Request Trial button to start the process!