Design: 3-way Integration with Rietveld and the Try Server

Objective

Drastically reduce the perceived time (wall-clock time) from asking the infrastructure to commit a change list (CL) to the time it is committed. This reduces the time a developer has to care about a particular CL, by getting it in or refusing it as fast as possible.

Background

The Chromium team uses the Commit Queue, which in turn uses the Try Server, to perform automated testing of their pending CL on multiple platforms in multiple configurations. Rietveld is used for code reviews and the Chromium team uses a slightly forked version to better integrate it with the rest of the Chromium infrastructure.

Before the integration, the Commit Queue is awaken as soon as the CL author, or one of his reviewer, checks the infamous Commit checkbox on Rietveld. The Commit Queue polls updates from Rietveld at a rate of 0.1hz to get new CLs to test.

This means that pre-commit verifications, which include code styling, OWNERS check, a proper LGTM from at least one Chromium Committer, properly testing the patches on the Try Server, only starts at the moment the checkbox is set. This means a maximal time is spent between the moment the checkbox is checked and the change is committed.

Also, before the integration, because of the way try jobs are triggered, the Commit Queue cannot trust the try jobs triggered by anyone but itself. The "previous way" of triggering try jobs is to commit a raw diff into subversion. Let's just say that it is suboptimal.

Overview

This project is about implementing techniques in the integration so the Commit Queue can efficiently run presubmit checks and Try Jobs on the Try Server before the CL author request the commit to be merged into the source tree;

The Commit Queue is also trusting the try jobs triggered by others than itself. The new way involves the Try Server polling Rietveld directly to fetch the diff directly from the code review. This is the reason the try jobs can be "trusted".

Rietveld acts as a datastore to map a Verification steps to a Patchset.

Infrastructure

The infrastructure consist of:
  1. Rietveld is running on AppEngine, using AppEngine's supplied GAIA account management.
    1. It becomes the primary database for the try job results.
  2. The Commit Queue process, which is a stand alone process.
  3. The Try Server, which is an infrastructure itself.

Detailed Design

1. Trusting try jobs

The first task is to make sure Try Jobs results on Rietveld can be trusted. This is done by differentiating try jobs triggered directly from Rietveld from any other try jobs. The polling code is in try_job_rietveld.py and sets the build property try_job_key that is then used to certify this try jobs was indeed triggered from Rietveld in view_chromium.py. The json packets are pushed via Buildbot's HttpStatusPush implementation from the Try Server back into Rietveld's /status_listener URL as HTTP POST requests.

2. Polling for pending CLs

Rietveld already has a mechanism to search for issues with specific criteria. For example, 
While polling it really not optimal, it permits reusing the current API. The /search interface supports specifying an opaque datastore cursor so only new issues are returned. The cursor to use on the next request is returned inside the json data.

3. Reacting to CL modifications

For each new CL modified, the CQ must decide if it reacts to the CL. To enable preemptive CL verification, the following steps are used on each new CL;
  1. Determine if the CL author is a trusted author, e.g. a Chromium Committer.
    1. If it is and the user asked for review, initiate checks.
    2. If not, wait for a LGTM from a trusted reviewer. Note that it's not waiting for the Commit checkbox to be checked.
  2. Keep track of the previous checks for the last patchset, discarding data as new patchsets are uploaded.
  3. Add a note on the issue if a check failure occurs.
Then for commit time;
  1. Upon Commit checkbox signal, only re-run the Secure Checks and commit.
    1. Secure checks include things that must not change; the CL description must have stayed the same, the tree is open, etc.

Project Information

This project is driven primary by rogerta@ and maruel@. It benefits immediately from Try Server enhancement projects like the Swarm integration project but it is still designed to work without Try Server performance boost and work around its general sluggishness.

Caveats

  1. Presubmit checks are not yet run on the Try Server but are run locally instead. This needs to be changed to accelerate the CQ and keep its integrity.
  2. This significantly increases the Try Server utilization. Without the Swarm integration project, this would likely overload the Try Server.

Latency

The whole purpose of this project is to reduce the latency of the pre-commit checks. The latency will boils down to
  1. Poll rate.
  2. Pre-commit forced checks.
At that point, we are not aiming at optimizing the commit delay below 30 seconds, lower than that is a nice to have.

Scalability

The Commit Queue is a single threaded process. As such, it is bound by its slowest operations. This currently include the PRESUBMIT checks and applying/commiting CLs. The goal is to make checks happen earlier and reuse the results from previous checks whenever they are trusted, not to make the checks faster.

Redundancy and Reliability

There are multiple single points of failure;
  1. Rietveld on AppEngine. If AppEngine blows up, we're screwed.
  2. The Try Server, which is a single-threaded process. Restarting it, in case of hard failure, is only lossy for the jobs in progress at the time of shutdown. In case of catastrophic failure, all the previous try jobs would need to be re-run. This already happened in the past and can be handled relatively fast, within an hour or so.
  3. The Commit Queue process itself is also single-threaded and single-homed. It is designed to be mostly lossless when restarted or if an exception occurs. The Try Jobs database can be reconstructed from the information stored on Rietveld.

Security Consideration

The security is using the following assumptions;
  1. Rietveld administrators are not rogue.
  2. Rietveld does not permit a CL author to fake its identity and the patchsets, once created, are immutable.
    1. The commit queue ensures the CL description is not modified between the moment it is running its presubmit check and the moment it is committing.
  3. The Commit Queue is doing proper verification w.r.t. which is a trusted committer.
    1. In the end, the Commit Queue process is the one deciding if a commit gets in or not and doing it.
  4. The committer are trusted. They already can commit anything they want. It is an non-goal to restrict what committers can commit.

Testing Plan

The integration itself, being an integration project, is mostly tested directly "on prod". While there are a few unit tests to check basic functionality, most of the bugs are at integration points. This being a productivity tool for developers, it is expected that they have some clue about what's going wrong.
Comments