Isolated Testing Infrastructure

Objective

"Build and test at scale". This is focusing on the test part. Drastically reduce whole test cycle time by scaling test sharding across multiple slaves seamlessly. It does so by integrating Swarming within the Try Server and eventually the Continuous Integration masters.

This page is about the Chromium specific Swarming infrastructure. For the general Swarming Design, see https://code.google.com/p/swarming/wiki/SwarmingDesign.

Background

  • The Chromium waterfall currently uses completely manual test sharding. A "builder" slave compiles and creates a .zip of the build output. Then "testers" download the zip, checkout the sources, unpack the zip inside the source checkout and run a few tests.
  • Each new "tester" configuration is created to run a subset of the tests so overall most of the meta-shards takes roughly the same amount of time. All this configuration is done manually and is error-prone.
  • For the Try Server, there is currently no test sharding at all since it'd be relatively complicated to setup inside buildbot.
  • So overall, while we can continue throwing more faster hardware at the problem, the fundamental issue remains; as tests gets larger and slower, the end-to-end test latency will continue to increase, slowing down developer productivity.
  • This is a natural extension of the Chromium Try Server (initiated and written by maruel@ in 2008) that scaled up through the years and the Commit Queue (initiated and written by maruel@ in 2011).
  • Before the Try Server, team members were not testing on other platforms than the one they were developing on, causing constant breakage. This helped getting at 50 commits/day.
  • Before the Commit Queue, the overhead of manually triggered proper tests on all the important configuration was becoming increasingly cumbersome. This could be automated and was done. This helped sustain 100 commits/day.

But these are not sufficient to scale the team velocity at over 150 commits per day. Big design flaws remain in the way the team is working. In particular, to scale the Chromium team productivity, significant changes in the infrastructure need to happen. In particular, the latency of the testing across platforms need to be drastically reduced. That requires getting the test result in O(1) time, independent of:

  1. Number of platforms to test on.
  2. Number of test executables.
  3. Number of test cases.
  4. Duration of each test cases, especially in the worse case.
  5. Size of the data required to run the test.
  6. Size of the checkout.

To achieve this, sharding a test must be a constant cost. This is what the swarming integration is about.

Overview

Using Swarming works around Buildbot's limitations and permits sharding automatically and in an unlimited way. For example, it permits sharding the test cases on a large smoke test across multiple slaves to reduce the latency of running it. Buildbot on the other hand requires manual configuration to shard the tests and is not very efficient at large scale.


By reusing the Isolated testing effort, we're going to be able to shard efficiently the swarm slaves. By integrating swarming infrastructure inside buildbot, we'll work around the manual sharding that buildbot requires.

To recapitulate the Isolated design doc, isolateserver.py is used to archive all the run time dependencies of a unit tests on the "builder" to Isolate Server. Since the content store is content-addressed by the SHA-1 of the content, only new contents are archived. Then only the SHA-1 of the manifest describing the whole dependency is sent to the Swaming Slaves, with an index of the shards that it needs to run. That is, 40 bytes for the hash plus 2 integers is all that is required to know what OS is needed and what files are needed to run a shard of test cases along run_isolated.py.

How the infrastructure works

  1. Try Server builder (linux_rel, mac_rel, win_rel) archives the builds on https://isolateserver.appspot.com.
  2. The builder triggers a build on the builder swarm_triggered.
  3. swarm_triggered only does 2 things: trigger Swarming tasks and collect the results.
The Commit Queue uses Swarming indirectly via the Try Server.

A Try Builder is a column on the Try Server waterfall and describes a premade configuration.
For each for linux_rel, mac_rel and win_rel, tests are slowly moved over to the .isolate format and are run on the swarm_triggered Try Builder. 
  • linux_rel, mac_rel, win_rel
    • Archives the tests during compile. They only archive if a foo_swarm test filter is specified, like git cl try -b linux_rel:browser_tests_swarm
    • Triggers swarm_triggered which continues from there.
  • swarm_triggered
    • Is an thin interface to Swarming. swarming.py takes each of the .isolated task and shards it on multiple Swarming slaves simultaneously.
    • The main difference between the normal bots and the Swarming slaves is they do not have a source checkout. This is the most important difference.
    • Swarming slaves are independent of Try Slaves, they are much weaker and smaller.

Summary

Try Builder CompilesChecks out the sources  Multiple test execution
 linux_rel, mac_rel, win_rel YesYesSerially 
 swarm_triggered NoNo (gets everything from isolated)Parallel (on Swarm slaves) 

So there is really 2 layers of control involved. The first being Buildbot master which controls the overall "build", which includes syncing the sources, compiling, requesting the test to be run on Swarming and asking it to report success or failure. The second layer is the Swarming server itself which "micro-distribute" test shards. Each test shard is actually a subset of the test cases for a single unit test executable. All the unit tests are run concurrently. So for example for a Try Job that requests base_unittestsnet_unittestsunit_tests and browser_tests to be run, they are all run simultaneously on different swarming slaves, and slow tests, like browser_tests, are further sharded across multiple slaves, all simultaneously.

Here are pretty graphs:

How it looks like for a user

Try Job swarm_triggered


How the Try Server is using Swarming

Chromium Try Server Swarming Infrastructure


How using Swarming directly looks like

Using Swarming directly


Project information

  • This project is an integral part of the Chromium Continuous Integration infrastructure and the Chromium Try Server.
  • While this project will greatly improve the Chromium Commit Queue performance, it has no direct relationship and the performance improvement, while we're aiming for it, is totally a side-effect of the reduced Try Server testing latency.
  • Active project members: maruel@, csharp@, vadimsh@.
  • Code: https://code.google.com/p/swarming/.

Appengine Servers

Canary Setup

Roadmap

The general approach we use is to do TS->CQ->CI to progressively increase the load and stability requirements. It boils down to:
  1. Roll out to Try Server for manual job. [done]
    1. Improve stability, ensure there's enough slaves, rince and repeat.
  2. Roll out to CQ i.e. the CQ use Swarming jobs on the TS. [done]
    1. Improve stability, ensure there's enough slaves, rince and repeat.
  3. Add non-gatekeeper bots on the CI.
    1. Improve stability, ensure there's enough slaves, rince and repeat.
  4. Add as gatekeeper bots on the CI in addition to the current ones.
Once this will have baked for a while, we'll reconsider the moves. We need to first create a track record of stability. And to catch scaling issues, we need to scale progressively. We made the whole infrastructure fairly resilient to partial availability, which is all too common at Google scale.

In parallel;
  1. Adding more tests and more configurations. The .isolate format itself is evolving to improve the N-dimension matrix that needs to be reduced.
    1. For example, Debug, Chromium on ChromiumOS, ASAN, Component builds, etc.
  2. Improving correctness by enforcing an read only tree, more on this later.
  3. Adding more slave types. Chris is doing structural changes so we can add more varied configurations. As an example XP vs Win7 or "GPU" on-demand Swarming slave.
  4. Fixing ACLs so users can trigger jobs themselves. Vadim is (soon) working on this. <- this will be awesome.
  5. Making Swarmed tests compatible with the flakiness dashboard. This will likely be a blocker on further deployment.
  6. Continuously improving monitoring and reliability.
The transition path involves making tests to always run in isolated mode. The tests are still run on the same machine but only have access to the files that are listed in the .isolate file. This means that tests run this way can be switched over to run on Swarming without any additional work by the developers.

Juicy Stats

Caveats

The Isolated testing infrastructure moves around a large number of bits. This will likely put a lot of pressure on the network I/O at the edge. Depending on how bad it is in practice, we will measure as implementation continues, we'll decide on the three implementations:
  1. Do nothing, keep the Isolate Server on AppEngine.
  2. Keep the Isolate Server on AppEngine but put a Squid proxy inside the DMZ to reduce inbound traffic.
  3. As the worst case, give up on AppEngine and keep a local server inside the DMZ. It has a few downsides;

Latency

This project is primarily aimed at reducing the overall latency from "ask for green light signal for a CL" to getting the signal. The CL can be "not committed yet" or "just committed", the former being the Try Server, the later the Continuous Integration servers. The latency is reduced by enabling a higher of parallel shard execution and removing the constant costs of syncing the sources and zipping the test executables, both which are extremely slow, in the orders of minutes.
Other latencies includes;
  1. Time to archive the dependencies to the Isolate Server.
  2. Time to trigger a Swarming run.
  3. Time for the slaves to react to a Swarming run request.
  4. Time for the slaves to fetch the dependencies, map them in a temporary directory.
  5. Time for the slaves to cleanup the temporary directory and report back stdout/stderr to the Swarming master.
  6. Time for the Swaming master to react and return the information to the Swarming client running on swarm_triggered.

Scalability

Python based AppEngine servers are not super scalable. We enable threadsafe mode on the python 2.7 system to improve its performance. Rewriting server side code in Golang is considered.

Redundancy and Reliability

There are multiple single points of failures
  1. The Isolate Server which is hosted on AppEngine.
  2. The Swarming master, which is also hosted on AppEngine.
  3. The buildbot masters, which are single-threaded processes written in python.
There is currently no redundancy for the buildbot infrastructure, if a VM dies, it is simply replaced right away by a sysadmin. The swarming slaves are intrinsically redundant. The Isolate Server data store isn't redundant or reliable, it can be rebuilt from sources if needed. If it fails, it will block the infrastructure.

Security Consideration

Since the whole infrastructure is visible from the internet, like this design doc, proper DACL need to be used. Both the Swarming master and the Isolate Server require valid Google accounts. The credential verification is completely managed by AppEngine.

Testing Plan

All the code (Swarming master, Isolate Server and swarming_client code) are tested in canary before being rolled out to prod. See the Canary Setup above.

FAQ

Why not a faulty file system like FUSE?

Faulty file systems are inherently slow: every time a file is missing, the whole process hangs, the FUSE adapter downloads the file synchronously, then the process resume. Multiply 8000x; that's what browser_tests lists. With a pre-loaded content-addressed file-system, all the files can be cached safely locally and be downloaded simultaneously. The saving and speed improvement is enormous.
Comments