Design doc: Chromium Commit queue

Objective

Have a formal way for developers to ensure their patch won't break the Continuous Integration checks with a relatively high confidence level.

Background

Before the Chromium Commit Queue, it's on each developer's shoulder to manually run multiple try jobs on the Try Server and check their results before committing.

This is "brain wasted time" as this can be completely automated. The Commit Queue aims at automating this manual verification so the developer can start right away working on the next patch. This is necessary to scale at a sustained ~100 commits/day.

Overview

The CQ polls rietveld for CLs that are ready to be committed. The Chromium's fork of Rietveld has a 'Commit' checkbox. When the author or the reviewer checks it, it will be included in the next poll. Once the CQ learns about the CL, it verifies the author or one of the reviewers approving this CL is a full committer, then runs the presubmit checks, then runs try jobs and then commit the patch on the behalf of the author, faking its credential. The whole project is written in python.

Infrastructure

It runs as a single-thread process. The infrastructure is really minimal since it's just a logic layer above the current infrastructures to automate something that was done manually by the developers. In particular, the CQ reuses:

Detailed Design

Runs the following loop:
  1. Polls http://codereview.chromium.org/search?closed=3&commit=2 to find new issues to attempt to commit.
  2. For each issue found,
    1. Runs preliminary checks,
      1. Make sure someone is a full committer
      2. Make sure there's a LGTM or the CL is TBR'ed.
    2. Runs presubmit checks, including OWNERS check.
    3. Sends new try jobs with -r HEAD.
      1. They are publicly visible at http://build.chromium.org/p/tryserver.chromium/waterfall?committer=commit-bot@chromium.org.
      2. Astute users will see that it runs a subset of the tests. Please contribute to Flaky tests fight so more tests can be used.
      3. If ToT is broken, the commit queue will retry your patch on an older revision automatically.
    4. Makes sure no new comments were added to the issue.
    5. If everything passes, once the tree is open;
      1. Commits the change on the behalf of the issue owner, even if the owner is not a Chromium commiter.
    6. If one of the steps fails, the 'commit' checkbox is cleared and the issue is removed from the queue.
      1. However, they author or a reviewer can re-check the box and the CQ will try again.
Issues are processed asynchronously so whatever faster try job completes first wins. This is important as a flaky CL won't bottleneck the remaining one. This is mainly because the number of commit per hour is disproportionate to the try server latency, e.g. the full build and test cycle time for all the platforms.

Faking author in subversion

To commit on the behalf of the author, an unconventional technique is used in subversion with a server-side pre-commit hook. The code is at: http://src.chromium.org/viewvc/chrome/trunk/tools/depot_tools/tests/sample_pre_commit_hook?view=markup

The control flow is:
  1. On the client side, with a checkout with a special committer credential;
    1. svn commit --with-revprop realauthor=<author to attribute the commit to>
  2. On the server side;
    1. A pre_commit_hook intercepts the commits
    2. It opens <repo>/db/transactions/<tx>.txn/props and parses its data
    3. If the realauthor svn property if found,
      1. It verifies svn:author is a special committer
      2. It does simple sanity checks
      3. It replaces svn:author with realauthor's value
      4. It sets the commit-bot svn property to know this revision was committed by the CQ.
    4. The updated data is saved in the transaction file.
This works much better than using svn propset --revprop since there is no race condition and the revision property modifications are done during the commit. This technique can only be used when it is possible to set a server side hook so for example, it can't be used with projects hosted on code.google.com.

Project Information

Caveats

The main problem is test flakiness. The CQ works around this problem partially by retrying failed tests a second time. It retries compile failure with a full rebuild, versus an incremental build normal, to work around cases of broken incremental compiles, which does happen relatively frequently. Fixing these two problems is outside the scope of the CQ project.

Rietveld doesn't enforce a coherent svn url mapping on its CLs, causing CLs to be ignored by the CQ.

Latency

The CQ is very slow at the moment. This is why the test isolation effort was started. The CQ is bound by:
  • Rietveld polling, which is at best 10 seconds.
  • Synchronous presubmit check execution, which is synchronous and single threaded, but with a timeout.
  • Try job execution, including automatic retries.
  • Waiting for the tree to open.
  • The actual commit, which is fairly fast.

Scalability

The main scalability issue is running the presubmit checks and sending the try jobs. Many of the presubmit checks assume they are not running concurrently on the system and running them in parallel on one VM could cause random problems. If the CQ used a more decentralized approach, it would scale much better for that.

Redundancy and Reliability

There are multiple single points of failure;
  • The CQ itself, running on a single process.
  • The try server, which is itself not redundant.

Security Considerations

The commit queue require a full committer involvement, either to be the author of the CL or to be a reviewer giving approval. The security depends on Rietveld assumptions about its meta data:
  • A rietveld issue cannot change of owner.
  • A rietveld comment cannot be faked by a third party.
All communications happens over https.

Testing Plan

Comprehensive set of unit tests was written. The code itself sends stack traces in case of exception for monitoring purposes.
Comments