Not all builds are made equal: Using priorities to expedite remote execution of the builds and tests that matter most¶

Imagine reading this post on your internal Slack:

Please hold off on pushing your PRs for a bit until the patch release is fully cut. We don't have enough capacity to handle all the load and the patch is blocked because of that.

Ouch.

But let's be real: not all builds are made equal - some builds are more urgent than others. Like here, when creating a patch release, and wanting to expedite the builds required for it over others.

Yet there's got to be a better solution than asking everyone else to not push changes.

A second hard reality: parts of a build may sometimes have to wait before they are executed. This is true irrespective of whether you use a remote execution system or not - for a local build, if you use a smaller machine or your machine's cores, memory, etc. are already in use, compiling code or running tests takes longer.

With remote build execution systems, similar limits exist. Commonly, RBE systems autoscale, i.e., if there is higher load, they provision more machines, and if there is less load, they release idle machines. But you may only have a maximum number of executors that you can scale out to. And even if you can scale out as far as ever required to handle peak load, scale out is not instant - it takes some time for new machines to become ready to execute your builds. And while not sufficiently many machines are available, some work will be queued, waiting to be executed.

Here's where priorities come into play. When you have fewer executors available than required to handle all execution load in parallel, you want to influence which builds get executed first. Without priorities, builds are commonly executed on a first come first served basis. But that's not necessarily what you want. For example, an emergency build should take precedence and get executed by the first available executor.

Understand your workflow¶

The third reality is that it's not just about emergencies or not. We have complex workflows and some parts are always more urgent than others. For example, take this workflow, which is heavily inspired by the one we have at EngFlow:

Develop: Our software engineers work on the code base. They continuously build code and run tests from their workstation. Eventually, they create a pull request (PR).
Pre-submit: Once the pull request is created, the CI system builds the code and runs pre-submit tests. To merge a change, all checks have to pass and at least one other engineer has to approve the change. Each time the PR is updated during the review, e.g. to address comments made by a reviewer, the pre-submit checks are run again. Once the PR has been approved and CI is green, the engineer merges the PR.
Post-submit: Now, post-submit tests are run. These include heavier integration tests, which are omitted from pre-submit to allow faster merges. If post-submit fails, changes are either reverted or need to be fixed forward with another PR.
Releases: Once a week, we cut a new release, which needs to pass even more rigourous testing.
Background tasks: Some builds and tests are run periodically on a schedule. This includes flaky test detection or pre-release creation.

This workflow is at the core of maintaining, improving and extending the products we offer. Your workflow may differ. But either way: the better that workflow is streamlined, the faster you can move. And we all want to move fast: we want to promptly fix bugs, improve existing features, add new ones.

From Workflow to Priorities¶

At EngFlow, we want to unblock our developers as much as possible, so they can stay in flow.

That's why the closer a build is to a developer, the faster we want it to be executed. Research says that faster builds increase productivity. In Developer Productivity for Humans, Part 4: Build Latency, Predictability, and Developer Productivity, Ciera Jaspan and Collin Green conclude:

Through experimentation, we’ve confirmed that even moderate improvements to build latency result in changes to developer behavior that indicate greater productivity: more builds, more lines of code written, and faster completion times for small/medium changes.

Therefore, we translate the above workflow to the following priorities:

Priority 5: Emergency builds get the highest priority. These are not part of the usual workflow provided above. Instead, it is only set during emergencies. That's crucial, because else those extremely time-sensitive builds may have to fight for executors with other builds.

Priority 4: Interactive builds from developers receive the second-highest priority. Enable engineers to stay in flow, and they will be both more productive and happier.

Priority 3: Building releases and CI for release branches come next. We prioritize these over other CI jobs, so that we can create releases and patch releases promptly at any time.

Priority 2: Pre-submit CI lands here. We want engineers to be able to merge approved and green PRs quickly, so we want pre-submit to be prioritized over post-submit.

Priority 0: This is the default priority imposed by Bazel and other build systems, if no explicit priority is set. We use this for post-submit CI.

Priority -1: Background tasks get the lowest priority of all. They may still be triggered at a higher priority if need be, but usually this is not necessary.

The priorities you pick may be different: they are specific to how you work and where you put your focus.

Mind the gap¶

But wait, "Where's Priority 1?", some of you may ask. Well spotted! That gap is intentional.

We currently don't have a merge queue. When a PR is approved and green, it can be merged directly. However, if conflicting PRs are merged in close succession, pre-submit may not detect those conflicts, only post-submit does. CI is read and that's disruptive. The conflict needs to be resolved either by reverting changes or fixing forward. A merge queue helps manage and control the order of merging pull requests to reduce the risk of such conflicts.

As we grow and more engineers merge more PRs, there's a good chance we will introduce a merge queue, too. So we reserved a priority slot for it, at the priority we feel makes sense.

Why reserve a value?¶

We check in to our source control the priority different builds get, be it CI or developer build. This is where things get interesting. What if we decided we wanted to add a new value between two existing priority values? And values need to be integers?

To make space, we'd have to either shift up all priorities above, or shift down all priorities below. Let's take the example of the merge queue, which we think has a lower priority than developer builds, but a higher priority than the default 0. We don't want to shift the default value, because that's what's used everywhere when we don't set a specific priority.

Shifting up is hard, too. We have the priority of developer builds checked in to our source control. If we change that value, what happens to the builds that aren't synced to the most recent version of the main branch? And what about the other long-living branches we have, for example older release branches that might need a patch?

Introducing a new priority value is therefore often not an instant change, but requires a migration.

So how do we avoid such migrations? By planning ahead. Think about which parts of your current workflow are time-critical to you. But also think about whether your workflow may include additional important stages in the future. There's no direct cost to reserving a value, but introducing a new one down the line may be a different story.

Don't overdo it, though. A simple system has many advantages. Leaving larger gaps between each currently used value over-complicates the system and is prone to future misuse. The cost of that may be greater than an eventual migration. So think about realistic gaps in your workflow, rather than leaving space "just in case".

TL;DR¶

In conclusion:

Priorities help make your time-critical builds and tests faster.
Consider your workflow and identify which parts are most important. Think about likely future changes to that workflow. Set up each part of the workflow with the appropriate priority.
Finally, reserve the highest priority for emergencies, so when you need to get a critical patch release done, you don't have to ask your colleagues to stop pushing their PRs for a while…

Read our documentation on how to set up priorities in Bazel when using an EngFlow RBE cluster.