How to Evaluate Remote Caching and Execution¶

Between roughly 2006 to 2008, Google developed remote caching and execution technologies to scale its massive monorepo based software development operation. This platform included Forge, the remote caching and execution platform, and Blaze, a tool for building large multilanguage software projects, eventually open sourced as Bazel. The advantages of this original platform were so obvious that it literally sold itself, and ultimately inspired the EngFlow platform.

Today, EngFlow is one of several competing remote caching and execution products now available in the commercial space. This post describes how we continuously benchmark our own product against different configurations to ensure that we offer the best possible value. We hope that sharing our methodology might help you evaluate whether remote caching or execution is right for your organization.

Using EngFlow's repository to evaluate the different platforms¶

Our own codebase is a good example of the kind of large, polyglot monorepo that our current and prospective customers wrestle with. It's comprised of approximately the following number of BUILD targets and nonempty lines of code, including vendored third party dependencies:

Language	# Targets	# Lines
Java	912	353844
Python	109	10697
Go	780	1103170
Bash	74	1702
C++	89	89318
Rust	19	513
TypeScript	330	72672
Protobuf	324	7342
Total	2637	1639258

While our own code base isn't the largest, it's of sufficient size and complexity to reflect the market we aim to serve with our product. Consequently, we believe that running an analysis against our own monorepo's CI provides more realistic insights than running against a sample of Open Source repositories.

We got these numbers from the following Bazel query script, which counts conventional binary, library, and test targets and ignores blank lines. It considers lines containing only parentheses, brackets, and comment delimiters to be blank as well.

count-build-targets-and-loc.sh
#!/usr/bin/env bash
#
# Counts BUILD targets and lines of code for specified language rule prefixes
#
# Only a ballpark approximation, and only covers `*_(binary|library|test) rule`
# targets. It should suffice for a quick analysis of repo size and complexity,
# but exact counts may require manual tuning.

main() {
  if [[ "$#" -eq 0 ]]; then
    usage 1 >&2
  elif [[ "$1" =~ ^(-h|--help)$ ]]; then
    usage 0
  fi

  local TOTAL_TARGETS=0
  local TOTAL_LINES=0

  print_row "language" "targets" "lines"
  print_row "--------" "-------" "-----"

  for language_rule_prefix in "$@"; do
    produce_language_counts "$language_rule_prefix"
  done

  print_row "--------" "-------" "-----"
  print_row "total" "$TOTAL_TARGETS" "$TOTAL_LINES"
}

usage() {
  echo "Usage: ${0##*/} [-h | --help] [<language rule prefix patterns>]"

  while IFS="" read line; do
    if [[ "${line:0:1}" != "#" ]]; then
      printf '%s\n' "${lines[@]:1}"
      exit "$1"
    fi
    lines+=("${line:2}")
  done <"$0"
}

print_row() {
  printf "%-8s %9s %8s\n" "$1" "$2" "$3"
}

file_extension_regex_for_language() {
  if [[ "$1" == "cc" ]]; then
    echo "[ch][^.]*"
  else
    echo "$1"
  fi
}

produce_language_counts() {
  local language="$1"
  local kind_query="kind('${language}_(binary|library|test) rule', //...)"
  local targets_query="filter('^//', ${kind_query})"
  local files_query=(
    "filter('\.$(file_extension_regex_for_language ${language})$',"
    "filter('^//', kind('source file', deps(${kind_query}, 1))))"
  )
  local files=(
    $(bazel query "${files_query[*]}" 2>/dev/null | sed -e 's#//##' -e 's#:#/#')
  )
  local blank_line_pattern='^[][[:space:]{}()/*#]*$'

  local num_targets="$(bazel query "${targets_query}" 2>/dev/null | wc -l)"
  local num_lines="$(grep -v "$blank_line_pattern" ${files[@]} | wc -l |
    tail -n 1 | sed -e 's/ total//' 2>/dev/null)"

  print_row "$language" "$num_targets" "$num_lines"
  ((TOTAL_TARGETS+=$num_targets))
  ((TOTAL_LINES+=$num_lines))
}

main "$@"

We invoke this script using a wrapper script containing our language patterns:

engflow-build-targets-and-loc.sh
#!/usr/bin/env bash

exec count-build-targets-and-loc.sh java py go sh cc 'ru?st?' '[tj]sx?' proto

Tools¶

We naturally adopted Bazel from the beginning of the company. This isn't surprising, considering how many former Blaze/Bazel developers and other former Googlers are on our team. On the other hand, it was the right tool for the job, given the scope and complexity of the system we aimed to develop.

We use GitHub Actions for continuous integration (CI), with remote caching and remote execution enabled. There's nothing revolutionary here, except for the parallel remote execution and caching enabled by our platform.

Methodology¶

There are many variables that contribute to overall performance and cost. Here is how we chose to control and observe a few key variables while maintaining a reasonably realistic experiment.

Load¶

We started by defining the load we want to throw against our remote caching and execution clusters. We considered two options: mirroring our full CI load against all clusters, or using only a subset of our actual load for benchmarking.

In our primary CI pipeline, we run our tests on Linux, macOS, and Windows on both x86_64 and arm64. (We also run a few "special" pipelines that use additional static analysis tools or integration tests against actual cloud services.) This allows us to test every pull request thoroughly and gives us high confidence that our changes don't end up causing issues in production. However, mirroring all our CI jobs against multiple clusters would create N times the load, costing N times as much as our regular CI pipeline. We want to use the shadow pipelines to collect data continuously, so we chose the subset option to keep our cloud cost manageable.

We also decided to simplify our shadow pipelines to only run bazel build //.... While this doesn't give us a perfect mirror of what we do on CI, we believe it is a good enough approximation.

Platform¶

We also had to decide, before running our benchmarks, on which of our supported cloud providers we wanted to host the experiment. Since most of our regular CI runs on Amazon Web Services (AWS), this decision was quickly settled in favor of Amazon's cloud offering.

Accounting¶

Lastly, we had to decide whether to create a dedicated AWS account for each benchmarking cluster or to deploy everything in a single account. Here, we took inspiration from our production clusters and gave each cluster a dedicated cloud configuration. We don't really need the isolation aspects of dedicated AWS accounts for our experiments, since all of the clusters are running the same internal builds. However, this setup allows us to easily grasp the infrastructure cost of each different configuration.

Conclusion¶

We're measuring the performance impact of code changes at a larger scale, as well as the total cost of ownership. Running each cluster in its own sandbox in the cloud produced a win-win situation for achieving these goals. The fact that this configuration more closely matches what we do in production is a nice benefit as well.

Collecting data from our actual CI builds over time helps us reason about the experience, performance, and associated costs of EngFlow. Isolated benchmarks and curated load testing has its place, and these techniques are helpful for us to identify bottlenecks and improve our products. But ultimately, we want to understand performance and cost tradeoffs from our customers' (and potential customers') perspective, which this methodology allows.

If you're considering running a similar experiment on your own code base, we'd love to hear about it! Following this methodology may prove more valuable than trying to interpret or recreate our results with repositories that might not be representative of your own.

And, of course, we'd love to help you set up an EngFlow trial to include our product in your analysis. Just visit the EngFlow home page and click the Request Trial button to start the process!