Putting bors on a PIP
We have a problem: the average queue of ready-to-test PRs to the main Rust repo has been steadily growing for a year. And at the same time, the likelihood of merge conflicts is also growing, as we include more submodules and Cargo dependencies that require updates to Cargo.lock.
This problem could threaten our ability to produce Rust 2018 on time, because as the year progresses, there will be an increasing amount of contention for the PR queue. My goal in this post is to avoid this outcome, without reliving the Rust 1.0-era experience of Alex working full time to land massive rollups.
In particular, the Rust All Hands is coming up next week, and I think it’s a great opportunity to dive into these issues, so after chatting for a while with Alex I wanted to set out some ideas.
Goals and problems
There are two major bors experiences we care about:
- Small PRs from early contributors. We want these to land very quickly to provide a good contribution experience.
- Major PRs. We want to avoid requiring lots of rebasing, or having too long a delay before landing.
Let’s say “bors time” is the amount of time a PR is in mergable and r+’ed state but not yet landed. Quantitatively, the above goals probably map to:
- Low average bors time
- Low maximum bors time
Today, rollups generally group together a large number of small PRs; we then attempt to land those rollups aggressively. The result is improved average bors time, but often at the cost of worsening maximum bors time. That’s because of a few factors:
Rollups generally prioritize small PRs over old PRs. That is, the “normal” order for bors is to attempt the oldest ready-to-test PR. But when doing a rollup, we cherry pick throughout the queue, and then give maximum priority to the rollup.
Rollups generally bounce on the first couple of attempts, which is “wasted” time during which an older PR might have been able to merge.
In addition, rollups tend to cause a need for rebasing, which for major PRs introduces significant extra latency: the author has to get around to doing a rebase, then get back in the queue, with a non-trivial chance of being pre-empted by another rollup that requires further rebasing.
Overall queue length
The steady state of the queue has been worsening over time. The average queue length has roughly quadrupled over the last year, from 5 ready PRs to 20.
The longer the queue, the worse the situation is for major PRs, because the effects of rollups and rebase requirements is multiplied by the standing queue length.
I believe that, because of rollups, our current average bors time remains tolerable. But our maximum bors time has gotten quite bad as the steady-state queue size has continued to grow. We need to rebalance.
Reduce absolute cycle time
The most direct action is to reduce absolute cycle time by improving the build system. That will help across the board, and there is usually low-hanging fruit to be had on whatever our current-slowest build scenarios are.
I’m not qualified to say much more here, but I’m hopeful that, during the All Hands, the Infra Team can work together to come up with plans or guidance on this front.
Failed builds are often very expensive, since failures often occur late in the build cycle, meaning that we lose ~2.5 hours of serialized work.
Currently, spurious failures largely come down to timeouts; reducing absolute cycle time will help.
More radically, we could consider storing artifacts at each build stage, allowing us to retry a build without going through the cycle scratch. But that would amount to a complete reworking of the build system, and probably isn’t plausible for the Rust 2018 timeline.
But there are also “legit” failures — and there’s potentially a lot we could do to help there. We effectively have a two-stage CI system today:
Stage 1: automatic PR testing. Today this is a single Travis build, hence a small slice of our overall test suite. Stage 1 testing is almost always complete by the time a reviewer looks at a PR.
Stage 2: serialized, full PR testing via bors.
Stage 1 testing is parallelizable and masks queue length because it’s generally dominated by the time it takes to get a review. We can decrease the likelihood of legit failures by testing more build scenarios in stage 1. For example, we could gate stage 1 on Windows as well as Linux. And we could include more of the test suite in stage 1. Generally, we have some amount of free capacity to work with here, and we can always through in additional builders to get more.
We should also consider strongly gating on stage 1 passing before ever attempting stage 2 testing on a PR.
More generally, are there ways we can better take advantage of the two-stage system we have today, and the way in which stage 1 is “masked” by review latency?
Being more strategic with rollups
Right now rollups generally gather together a number of “easy” PRs. However, this comes at the cost of “hard” PRs, because rollups skip them in the queue, thus forcing a rebase. And rollups themselves tend to bounce on the first few tries, essentially blocking the build queue for some period.
Here are some ways we could be smarter about rollups:
Prioritize a rollup PR as if it is as old as the oldest PR is contains. That is, in bors’s normal queue ordering, a rollup would not make it possible to “jump the queue”. Rather, it would be running at the same time as its first PR normally would, but we’d be trying to “get more” out of the run.
Make it possible to pre-test rollup PRs with a greater number of build scenarios, without blocking the queue. We could do this by expanding the set of stage 1 tests (mentioned above) in general, or by having a separate “try”-style build command for rollups that tests a larger subset, but still much smaller than the full build (i.e. “Stage 1.5”). We should make it possible to aggressively run this larger suite of tests on a rollup build, before going to the full test suite. The core idea is that a rollup should ~never bounce.
Fix test failures within a rollup, rather than removing PRs from the rollup. This can be done by pushing changes back to the original PR branches. Basically, once we’ve invested in including a PR in a rollup, we should “see it through”.
Consider including “bigger” PRs in a rollup, and seeing them through as above.
Data to gather
There’s a bunch of data that would be great to have before the All Hands to help guide discussion:
- How often are we making rollups?
- How many PRs are included in the average rollup?
- How often do rollups bounce?
- What % spurious?
- How often do non-rollups bounce?
- What % spurious?
- For non-spurious failures, what are the major cases we’re missing in stage 1 that we catch in stage 2? E.g., Windows? A particular part of the test suite?
Are there other near term steps we could take to head off problems with the queue length? (While long-term improvements are important too, the immediate risk is that we will struggle to produce Rust 2018 over the next few months).