Design doc: Try server

Objective

Introduce a way to run pre-commit tests on a variety of platforms for the open source Chromium project.

Background

Before the chromium try server was written, the developers had to run each test locally and or commit and hope for the best. This became untenable as more and more supported OSes were being added. So the try server was born. It became possible to continue working while one or many slaves were running the tests.

Overview

The try server runs a similar configuration to the continuous integration server, except that it is triggered not on commits but on "try job request". The try server is based on the buildbot project and reuse the same code than the continuous integration servers. It has one "builder" per configuration support, for example "win" for Windows (debug is assumed) and "linux_rel" for ubuntu/release. Once a try job is requested, the try server selects one of the slave connected for this configuration and runs all the build steps to checkout, build and run the selected tests. The whole project is written in python.

Infrastructure

The try server is composed of:
  • One main process, the try server, running on a server. It is running the buildbot.net code.
  • Several slaves (in the order of hundreds) connected to the server, in general VMs, that can run one of the supported configuration.
  • A try job request queue. It is currently a svn repository but effort to use a better mechanism is on-going.

Detailed design

Server

The server is really just a standard buildbot configuration with a custom source trigger. The custom source trigger polls the subversion repository and reads the meta data and its diff file. Note that this access method is going to be deprecated as we are writing the ability to trigger the try job directly from Rietveld. The meta data contains information like:
  • The list of configuration to run this patch on. Optionally: for each of the configuration, the list of tests to be run.
  • The author, its email address.
  • Optional: associated code review patchset.
  • The revision of the chromium checkout to run the patch against
The server then selects a build slave, runs the whole build and tests on it. Once this is completed, it sends an email back with a status report to the author's email address.

User

Most of the user facing commands can be seen on the usage page. The way it works is:
  1. git-cl/gcl detects the list of modified files and generate a diff. It retrieves meta data associated with the diff like the rietveld issue number and the CL author.
  2. git-cl/gcl passes all this information to trychange.py
  3. trychange.py package this information and saves it into a subversion repository, which the try server polls, eventually triggering a build.

Slave

The try server slaves are exactly normal buildbot slave.
Project information

Caveats

Using a diff file makes it tricky as merge information is lost, binary file handling is non-trivial and the exact handling varies across platforms, especially for Windows. The problem with subversion should be clear, it doesn’t have any command to apply a diff it generates itself (!)

Having it completely disconnected from the code review makes it possible to send a different diff than the one sent on the code review. This is worked on.

The try server is a significant contention point and the current mechanism forces the patch file data to be transferred multiple times. The rietveld-try server integration will significantly improve on that.

The way the chromium try jobs are triggered is a good counter example of how to implement it efficiently. The way it currently works, with a subversion repository to store patches files, is because of historical reason, inertia and the fact that subversion is used on the slaves at the moment. 

Latency

The latency is sum of:
  • Time for git/svn to generate the diff
  • Time to svn commit the diff to the subversion repository. It is actually significant and can occasionally be in the order of tens of seconds.
  • Polling interval, in the order of 10 seconds
  • Wait for a build slave to be available. Instantaneous unless under 100% usage for a configuration.
  • Wait for the the build to complete. By far the most
  • Email propagation delay.

Scalability

The try server process itself doesn't do much beside serving summary web pages, transferring in and out build steps stdout and the diffs. The number of try slaves is still somewhat bounded to the hundreds of slaves range, due to the fact that the try server process control loop is in a single python thread.

Redundancy and Reliability

The reliability is limited by multiple single points of failures:
  • A single subversion repository to store the patches.
  • A single process running the try server.
  • The network connection must stay alive between the try server and the slave for the whole duration of the build.
  • The source control servers must be able support the load of hundreds of slaves checking out simultaneously.

Security Considerations

Only committers can request a try job. As such, they can already commit anything.

Testing plan

Since this is primarily a developer support infrastructure, the level of unit testing is relatively low. Still, presubmit checks verify the basic functionality of the server, e.g. it starts without crashing.

Historical notes

Previously, the patches were sent directly to the master via a simple HTTP post. It was quick to implement and worked fine. The try job started right away, without any form of polling. It was also greatly insecure. Using svn for everyone has simply been a quick workaround to add authentication to the patches, since only users with proper subversion credentials can commit to the try job subversion repository. It was also to allow external contributors to be able to trigger try jobs. The reason svn was used to store the diff is because the external contributors could use the same credentials that they were using for their commits, vastly simplifying management. Note that neither Google AppEngine nor Google CloudStorage existed at that time or were publicly available. Albeit, storing thousands of patch files in a single directory shown some limitation in svnserve, causing it to slow down to a crawl during a commit, e.g. a try job request.

Improvements

The first major improvement is to have the slave fetch the patch directly from rietveld. This way, the try server doesn't have to copy around the diff from the master to the server, significantly reducing overall I/O. Also in the git case, this causes the git process on the slave to be able to merge correctly the diff on the check out, removing much of the fuzziness that occurs with a diff based dataflow.

Using this method implies that there is a 1:1 mapping between a try job and a patchset on a code review server, independent of gerrit or rietveld. While many people claim this would be somewhat annoying, it's not as bad as it may seems, especially if support for "out-of-review" try job request, e.g. the current method, is kept. If so, arbitrary try jobs like ones that involve multi-repo changes can still be tried manually. Let me clarify here. I think is totally sane for me to have two different ways to trigger a try job. One integrated in the normal code review flow, another for totally experimental or CL encompassing a change that covers multiple repos. Still, if I were to do it, I would not host these “out-of-band” try jobs on the SCM. I’d use AppEngine or Cloud Storage. Using AppEngine has the advantages that it can be event based instead of Cloud Storage, which would require polling as it is currently done.

Implementing this technique permits to assert the authenticity of the patch that was tried, enabling the commit queue to reuse the test result information if it is not too old.
Comments