Swarm Design

Objective

"Build and test at scale". This is focusing on the test part.

Drastically reduce whole test cycle time by scaling test sharding across multiple slaves seamlessly.

It does so by integrating Swarm within the Try Server and eventually the continuous integration masters.

Background

The Chromium waterfall currently uses completely manual test sharding. A "builder" slave compiles and creates a .zip of the build output. Then "testers" download the zip, checkout the sources, unpack the zip inside the source checkout and run a few tests.

Each new "tester" configuration is created to run a subset of the tests so overall most of the meta-shards takes roughly the same amount of time. All this configuration is done manually and is error-prone.

For the Try Server, there is currently no test sharding at all since it'd be relatively complicated to setup inside buildbot.

So overall, while we can continue throwing more faster hardware at the problem, the fundamental issue remains; as tests gets larger and slower, the end-to-end test latency will continue to increase, slowing down developer productivity.

This is a natural extension of the Chromium Try Server (initiated and written by maruel@ in 2008) that scaled up through the years and the Commit Queue (initiated and written by maruel@ in 2011).

Before the Try Server, team members were not testing on other platforms than the one they were developing on, causing constant breakage. This helped getting at 50 commits/day.

Before the Commit Queue, the overhead of manually triggered proper tests on all the important configuration was becoming increasingly cumbersome. This could be automated and was done. This helped sustain 100 commits/day.

But these are not sufficient to scale the team velocity at over 150 commits per day. Big design flaws remain in the way the team is working. In particular, to scale the Chromium team productivity, significant changes in the infrastructure need to happen. In particular, the latency of the testing across platforms need to be drastically reduced.  That requires getting the test result in O(1) time, independent of:
  1. The number of platforms to test on.
  2. The number of test executables.
  3. The number of test cases.
  4. The duration of each test cases, especially in the worse case.
  5. The size of the data required to run the test.
  6. The size of the checkout.
To achieve this, sharding a test must be a constant cost. This is what the swarm integration is about.

Overview

Using Swarm works around Buildbot's limitations and permits sharding automatically and in an unlimited way. For example, it permits sharding the test cases on a large smoke test across multiple slaves to reduce the latency of running it. Buildbot on the other hand requires manual configuration to shard the tests and is not very efficient at large scale.

By reusing the Isolated testing effort, we're going to be able to shard efficiently the swarm slaves. By integrating swarm infrastructure inside buildbot, we'll work around the manual sharding that buildbot requires.

To recapitulate the isolated testing design doc, the isolate.py script is used to archive all the run time dependencies of a unit tests on the "builder" to the Isolate Content-Addressed Datastore hosted on AppEngine. Since the content store is content-addressed by the SHA-1 of the content, only new contents are archived. Then only the SHA-1 of the manifest describing the whole dependency is sent to the Swam Slaves, with an index of the shards that it needs to run. That is, 40 bytes for the hash plus 2 integers is all that is required to know what OS is needed and what files are needed to run a shard of test cases along a minimal python script of less than 1000 lines!

You can find an overview presentation slides at https://docs.google.com/a/google.com/presentation/d/18DS0Za8s9O9hCei2I2KTHUPXFV39HfAe5hbZRiUzNG8/view. Sorry, Googlers-only.

Infrastructure

The infrastructure reuses much of the current Chromium open source infrastructure. The different parts are:
  • One buildbot master controlling the builds, for example Chromium Continuous Integration masters or the Chromium Try Server. Each runs on a physical server.
  • Multiple buildbot slaves. They are connected to a buildbot master. The slave has the responsibility to run the "update" (sync sources) and "compile" steps. They have the special configuration "GYP_DEFINES=test_isolation_mode=hashtable" that make them archive the Isolate hashtable.
    • The buildbot slave acts as a swarm client, it sends to the swarm server the manifest, the command to run and the endpoint where to retrieve the necessary dependencies.
  • The isolate content-addressed datastore, it can be either on NFS/Samba or served off HTTP.
  • A Swarm Server running on AppEngine, which connects to a swarm slave provider, which provides VMs to run a test shard.
  • Multiple Swarm Slaves in the range of thousands VMs. They connect to the Swarm server and run test shards as they are available to run. They do not need to have connectivity to the Swarm client, only the Swarm Server.

Detailed Design

The workflow goes as follow;
  1. Have the buildbot "builder", e.g. the machines that compile the sources for Continuous Integration or the Try Server to archive the dependencies for foo_unittest_run into a hashtable with a json manifest that indicates the hash<->filepath relationship for each test to be run. This is the isolated testing infrastructure.
  2. Have Swarm based testers efficiently distribute the test shards across an army of slaves. Each of the slave grab the manifest, only map the necessary files from the hashtable and run a shard of the test using the isolated testing infrastructure. They keep a local cache for rarely modified files, like test data to reduce overall lower I/O. In practice, we see >90% cache hit rate.
  3. Have the buildbot "testers" simply defer to Swarm to run the tests and report the stdout back.
So there is really 2 layers of control involved. The first being buildbot master which controls the overall "build", which includes syncing the sources, compiling, requesting the test to be run on Swarm and asking it to report success or failure. The second layer is the Swarm server itself which "micro-distribute" test shards. Each test shard is actually a subset of the test cases for a single unit test executable. All the unit tests are run concurrently. So for example for a Try Job that requests base_unittests, net_unittests, unit_tests and browser_tests to be run, they are all run simultaneously on different slaves, and slow tests, like browser_tests, are further sharded across multiple slaves, all simultaneously.

The whole project is written in python.

Project information

Caveats

The isolated testing infrastructure moves around a large number of bits. This will likely put a lot of pressure on the network I/O at the edge. Depending on how bad it is in practice, we will measure as implementation continues, we'll decide on the three implementations:
  1. Do nothing, keep the Isolate Content-Addressed Datastore on AppEngine.
  2. Keep the Isolate Content-Addressed Datastore on AppEngine but put a Squid proxy inside the DMZ to reduce inbound traffic.
  3. As the worst case, give up on AppEngine and keep a local server inside the DMZ. It has a few downsides;
    1. It's now impossible to upload data from outside the DMZ. Use case: a developer does a local build, isolate.py hashtable's it and do a Swarm request to build it.
    2. It's now impossible to have Swarm slaves outside the DMZ. Use case: Google Compute.

Latency

This project is primarily aimed at reducing the overall latency from "ask for green light signal for a CL" to getting the signal. The CL can be "not committed yet" or "just committed", the former being the Try Server, the later the Continuous Integration servers. The latency is reduced by enabling a higher of parallel shard execution and removing the constant costs of syncing the sources and zipping the test executables, both which are extremely slow, in the orders of minutes.

Other latencies includes;
  1. Time to archive the dependencies in the hashtable.
  2. Time to trigger a Swarm run.
  3. Time for the slaves to react to a Swarm run request.
  4. Time for the slaves to fetch the dependencies, map them in a temporary directory.
  5. Time for the slaves to cleanup the temporary directory and report back stdout/stderr to the Swarm master.
  6. Time for the Swam master to react and return the information to the Swarm client.

Scalability

Python based AppEngine servers are not super scalable. We enable threadsafe mode on the python 2.7 system to improve its performance. Developing it in Golang was considered but there's little enthusiasm on the team to learn another programming language.

The possibility to efficiently shard. For example for test cases taking several tens of seconds becoming laggards to get the test results. This is worked around by over-sharding and using run_test_case.py, which is a script that runs each test case independently, one shard per CPU.

Redundancy and Reliability

There are multiple single points of failures
  1. The isolate content-addressed datastore which is hosted on AppEngine.
  2. The Swarm master, which is also hosted on AppEngine.
  3. The buildbot masters, which are single-threaded processes written in python.
There is currently no redundancy for the buildbot infrastructure, if a VM dies, it is simply replaced right away by a sysadmin. The swarm slaves are intrinsically redundant. The hashtable data store isn't redundant or reliable, it can be rebuilt from sources if needed. If it fails, it will block the infrastructure.

Security Consideration

Since the whole infrastructure is visible from the internet, like this design doc, proper DACL need to be used. Both the Swarm master and the Isolate datastore require valid GAIA accounts. The credential verification is completely managed by AppEngine.

Testing Plan

N/A; everything is tested "on prod" right at the moment. Though much of the code being used is thoroughly unit tested.
Comments