The GPU bots' recipe


Introduction


As mentioned on the GPU testing page, the GPU bots use a new framework from Chrome's infrastructure team called recipes. The previous infrastructure sent commands from the buildbot master to the various build machines. In comparison, recipes delegate most of the responsibility of deciding how to compile the code, run tests, etc. to the machine doing the build. Compared to the legacy buildbot scripts, recipes vastly simplify the tasks of modifying the bots' configuration, adding new steps, and locally testing changes to the bots. They virtually eliminate waterfall restarts when making changes to the bots.

This page describes the GPU recipe, how it's configured on the various bots, and how to modify and test it locally.

High-Level Description

The GPU recipe is run on almost all of the bots on the chromium.gpu waterfall (all but the Android bot, as of this writing), the GPU bots on the chromium.webkit waterfall, the GPU bots on the tryserver.chromium waterfall, and the GPU bots on the chromium.gpu.fyi waterfall. All of the bots on these waterfalls are split into builders and testers. The builders compile the code, upload the build results, and trigger the testers. The testers download the builds and run the tests.

On the GPU bots, the binaries are sent from builders to testers using isolates. An isolate contains the binary as well as any dependent libraries or data files. The GPU testers do not check out the Chromium workspace; they receive all of the binaries, test harnesses and data files in the isolates coming from the builders. The high-level point is that when adding any new tests to the GPU bots, they must be made to work with isolates. The Release builders build the static library build; the Debug builders build the component build, to speed up linking. Isolates work with both flavors.

The try servers run the same recipe used on the other waterfalls. The primary differences are twofold. First, they download and apply a patch to the source tree. Second, when running the pixel tests, they expect to download a reference image from cloud storage, rather than potentially uploading one to cloud storage. The reason for this behavior is that a bad patch may cause the try server to produce a bad image, so the try servers' results can not be trusted. Because the try servers rely on the other waterfalls to produce their reference images, there must be at least one bot with the same GPU and operating system configuration on the main waterfall (e.g. chromium.gpu) for each such configuration on the try server waterfall.

Code Organization

The GPU recipe lives in the tools workspace under tools/build/. Here is a .gclient for fetching the sources (only developers working at Google can fetch the internal sources; I don't know whether the recipe will run without them):

solutions = [
  { "name"        : "build",
    "url"         : "https://chromium.googlesource.com/chromium/tools/build.git",
    "deps_file"   : ".DEPS.git",
    "managed"     : True,
    "custom_deps" : {
    },
    "safesync_url": "",
  },
  { "name"        : "build_internal",
    "url"         : "https://chrome-internal.googlesource.com/chrome/tools/build.git",
    "deps_file"   : ".DEPS.git",
    "managed"     : True,
    "custom_deps" : {
    },
    "safesync_url": "",
  },
]
cache_dir = None

The GPU recipes themselves live in tools/build/scripts/slave/recipes/gpu/. There are three recipes: build_and_upload.py, download_and_test.py, and build_and_test.py. All of the GPU bots currently use either the build_and_upload or download_and_test recipes. build_and_test was previously used on some bots, but currently is used only for local testing.

The recipes themselves are short; the bulk of the logic is factored into modules. Here is, for example, the entire code for the build_and_upload recipe:

DEPS = [
  'buildbot',
  'gpu',
  'platform',
  'properties',
]
    
def GenSteps(api):
  api.gpu.setup()
  yield api.buildbot.prep()
  yield api.gpu.checkout_steps()
  yield api.gpu.compile_steps()
  yield api.gpu.upload_steps()

The GPU recipe module lives in tools/build/scripts/slave/recipe_modules/gpu/ alongside the other recipe modules. api.py contains most of the logic for the GPU bots. The more significant logic includes:
  • Whether to use git or not (git is not yet used on the bots, but is practically required for local testing)
  • Whether to use "Blink mode" (fetching top of tree Blink; used on the Blink waterfall as well as try jobs applying to Blink)
  • Whether the recipe is running on a try server, and therefore needs to download and apply a patch
  • Whether the builder/tester pair is using isolates

Running the Recipe Locally

Testing recipes locally is easier than testing buildbot script changes. It's not necessary to run buildbot locally and trigger builds by hand. run_recipe.py executes the recipe in the same way it is run on the bot. Command line arguments easily change the behavior of the recipe.

When running the "builder" recipes (build_and_upload, build_and_test), a separate checkout of the entire Chromium source tree is made into the tools/build workspace. This takes a fair amount of time the first time. It is recommended to use a git checkout for local testing, even though this configuration is not yet used on the bots. More information below on how this is configured.

As the GPU recipes have evolved, the number of command line arguments required in order to execute them properly has increased. Normally buildbot specifies these arguments, but "fake" values which are good enough can be used in order to test changes to the recipes.

Unfortunately, multiple dependencies prevent anyone except Google employees from running the GPU recipe effectively. If you are a non-Google Chromium contributor and wish to make contributions and run the recipe locally, please file a bug at crbug.com/new with the label Cr-Internals-GPU-Testing.

Goma

The Chromium project's distributed build system, Goma, is currently required to run the GPU recipe, which unfortunately limits its direct execution to Google employees. Visit go/ma for setup instructions. Goma must be installed into tools/build/goma/, so that tools/build/goma/goma_ctl.sh exists on disk. When running the recipe locally, other instances of the compiler proxy should be stopped.

In short:
  1. cd tools/build/
  2. mkdir goma
  3. cd goma
  4. <fetch the goma_ctl.sh from the instructions above>
  5. chmod a+x goma_ctl.sh
  6. ./goma_ctl.sh update

Credentials for Various Servers

As of this writing, the GPU recipe requires credentials for two services: namely, the isolate server and cloud storage.

Isolate Server Credentials

Release builds via the GPU recipe automatically upload their results to the isolate server, so it is required to first authenticate to it. From a Chromium checkout, run:

./src/tools/swarming_client/auth.py login --service=https://isolateserver.appspot.com

This will open a web browser to complete the authentication flow. A @google.com email address is required in order to properly authenticate.

To test your authentication, find a hash for a recent isolate. For example, go to a recent build on linux_gpu_triggered_tests, go to the setup_build step, search for the "swarm_hashes" property, and take a random hash from one of the targets like content_gl_tests. Then run the following:

./src/tools/swarming_client/isolateserver.py download -f [hash] delete_me --isolate-server https://isolateserver.appspot.com

If authentication succeeded, this will silently download a file called "delete_me" into the current working directory. If it failed, the script will report multiple authentication errors. In this case, use the following command to log out and then try again:

./src/tools/swarming_client/auth.py logout --service=https://isolateserver.appspot.com

Cloud Storage Credentials

Authentication to Google Cloud Storage is needed for a couple of reasons: uploading pixel test results to the cloud, and potentially uploading and downloading builds as well, at least in Debug mode. Use the copy of gsutil in depot_tools/third_party/gsutil/gsutil, and follow the Google Cloud Storage instructions to authenticate. You must use your @google.com email address and be a member of the Chrome GPU team in order to receive read-write access to the appropriate cloud storage buckets. Roughly:
  1. Run gsutil config
  2. Copy/paste the URL into your browser
  3. Log in with your @google.com account
  4. Allow the app to access the information it requests
  5. Copy-paste the resulting key back into your Terminal
  6. Press "enter" when prompted for a project-id (i.e., leave it empty)
At this point you should be able to write to the cloud storage bucket.

Navigate to https://storage.cloud.google.com/?arg=chromium-gpu-archive to view the contents of the cloud storage bucket.

Running the recipes

The GPU recipes live in tools/build/scripts/slave/recipes/gpu. As of this writing there are three main recipes: build_and_upload.py, download_and_test.py, and build_and_test.py. All of the GPU bots on all of the waterfalls -- including the Chromium and WebKit GPU bots, the FYI bots, and even the GPU try servers -- are running either the build_and_upload or download_and_test recipes. build_and_test is at this point mainly used for local testing, though since the introduction of isolates, download_and_test is much easier to use for that purpose. 

The build_and_test and build_and_upload recipes

Once all authentication is complete, the recipes can be run. Here is an example invocation of the build_and_test recipe:

./scripts/tools/run_recipe.py gpu/build_and_test use_mirror=False [use_git=True] revision=253330 build_config=Release buildername="Linux Release (myname)" buildnumber=1503 slavename=mynamelinux mastername=chromium.gpu.myname > recipe_output.txt 2>&1

This will run the recipe and put all of its output into "recipe_output.txt" in the current working directory. You can watch its progress in another terminal by running 'tail -f recipe_output.txt'. It is strongly recommended to capture the recipe's entire output when running it locally so that you can easily search back for unexpected failures.

Throughout, it is recommended to replace "myname" with your login, so in case you write results to cloud storage, it can be easily identified who wrote them.

Some notes on the command line arguments:
  • use_mirror=False: whether to use the mirrors in the Chromium golo. This should be set to False for local testing. (It's set to true on the real bots.)
  • use_git=True: optional; use git during the checkout. This flag is often necessary during local testing in order for the checkout to complete in a reasonable amount of time. However, some code paths (like uploading results to the flakiness dashboard) may not have been fully tested in git mode. The bots do not use git as of this writing.
  • revision: the svn revision or git hash to use. When running the recipe locally you should usually set this to the top of tree revision, or close to it. If you set use_git=True then you must supply a git hash for this value.
  • build_config: set to Release or Debug. Release will automatically build and upload isolates. Debug will upload a build to cloud storage.
  • buildername, slavename, mastername: choose synthetic values for these which are similar in form to those above, for consistency. Note that for the Debug flavor of the recipe, builds will be uploaded to https://cloud.google.com/console/storage/chromium-gpu-archive/ under [mastername]/[buildername]/full-build-[os]_[revision].zip.
  • buildnumber: used to name some cloud storage upload results. Choose a random value.
The build_and_test and build_and_upload recipes require the same basic command line arguments.

There are some other arguments which may be useful for local testing. skip_checkout can be used to skip the whole-workspace "gclient sync" operation which usually triggers large rebuilds. skip_compile can be used to skip the compile, reusing the last run's binaries. See the build_and_test recipe for 

The download_and_test recipe

The download_and_test recipe requires additional arguments, because ordinarily the testers are triggered by the parent builder machines.

Example invocation on Linux:

./scripts/tools/run_recipe.py gpu/download_and_test revision=275027 parent_got_revision=275027 parent_got_webkit_revision=175512 parent_got_swarming_client_revision=ae8085b09e6162b4ec869e430d7d09c16b32b433 build_config=Release buildername='Linux Release (myname)' buildnumber=1785 slavename=mynamelinux mastername=chromium.gpu.myname swarm_hashes='{"angle_unittests":"93f03d694fe85f909127fff9fd56819ef6cb4c17","content_gl_tests":"94c7b3e327928bf7bce54c65c5ce8e787f2fad98","gl_tests":"fae574e343a6fe0f3840610404bbb505ab76f6ad","gles2_conform_test":"9298eb0ebed27957cf96c13a44a9cf40977e0310","tab_capture_end2end_tests":"ac697bc3d3a5a88bc797c1388071b74784f7497e","telemetry_gpu_test":"d8bf337de1bf900cf5c1ead3f8452096da946988"}' master_class_name_for_testing=ChromiumGPUMYNAME > recipe-output.txt 2>&1

Example invocation on Windows:

D:\src\depot_tools\python276_bin\python.exe scripts\tools\run_recipe.py gpu/download_and_test revision=275179 parent_got_revision=275179 parent_got_webkit_revision=175568 parent_got_swarming_client_revision=ae8085b09e6162b4ec869e430d7d09c16b32b433 build_config=Release buildername="Win Release (myname)" buildnumber=1785 slavename=mynamewin mastername=chromium.gpu.myname swarm_hashes="{'angle_unittests':'7be0445b4863e7f8a75d0cec92d9cf44ca40ab87','content_gl_tests':'8741fc43b973e3dfa91f095267e33d2b42878a9f','gl_tests':'c1d1093564231547db598450cbbca6cfc76e57dc','gles2_conform_test':'380fab3a5bc8a766b81d918ee3aadab7dcdc5ac6','tab_capture_end2end_tests':'8e0589b0d0001c18f1735964ce3973a666b58323','telemetry_gpu_test':'d6c053f0ad79999b4eea461bf3e417ae3f9776a3'}" master_class_name_for_testing=ChromiumGPUMYNAME > recipe-output.txt

Replace "myname" everywhere with your username to disambiguate any results which might be uploaded to the flakiness dashboard inadvertently.

Notes on the command line arguments:
  • parent_got_revision, parent_got_webkit_revision: these are ordinarily transmitted from the builder to the tester, and is required in order to identify the run on the flakiness dashboard and other places. For parent_got_revision, pick a revision close to Chromium's top of tree revision. For parent_got_webkit_revision, pick one close to Blink's top of tree revision (note that this waterfall displays both Chromium and Blink commits). If you're running with isolates, the exact numbers are not important.
  • parent_got_swarming_client_revision: tells the recipe which version of swarming_client should be used. Use the value in "swarming_revision" in src/DEPS.
  • build_config: Debug or Release. Debug will try to download a build of that revision from cloud storage. Release will run the isolates identified in swarm_hashes. It is much easier and faster to run via isolates in Release mode.
  • slavename, mastername: see above.
  • master_class_name_for_testing: the flakiness dashboard identifies the waterfall by the so-called buildbot "class name", which for the chromium.gpu waterfall is ChromiumGPU, for example. If you want to test the code path which uploads results to the flakiness dashboard, supply a value here.
  • swarm_hashes: these identify the isolates which were uploaded to the isolate server during the build step. It's required to supply hashes for all of the tests currently yielded by test_steps() in tools/build/scripts/slave/recipe_modules/gpu/api.py. For your given OS, you can find a list of recently-valid hashes by looking at one of the Release builders like Linux Release (NVIDIA), going to a recent successful build, going to the stdio for the "setup_build" step, and copying the swarm_hashes property. Note that you will have to reformat it with single quotes around the braces, double quotes within, and no "u" prefixes on the strings.

Testing your own isolates with the download_and_test recipe

The easiest way to see how tests are invoked on the bots is to build isolates out of your own Chromium workspace, upload them to the isolate server, and then run the download_and_test recipe, passing the isolates' hashes in the swarm_hashes property. To do this:
  1. Set the test_isolation_mode, test_isolation_outdir, and archive_gpu_tests GYP_DEFINES. The easiest way to do this is to create a file called chromium.gyp_env at the top level of your Chromium workspace (alongside src/) with the following contents:

    {'GYP_DEFINES': 'test_isolation_mode=archive test_isolation_outdir=https://isolateserver.appspot.com/ archive_gpu_tests=1' }
  2. Run gclient runhooks
  3. Build the desired isolate targets. See src/chrome/chrome_tests.gypi and look for targets ending in "_run". If you only want to test one isolate, you can build just its target and supply its hash in the swarm_hashes, picking up the rest of the hashes from a recent build on the same OS from one of the GPU testers.
  4. The build output will contain something like:
    5d4a142d0fb6f51d8814c16e74ac622c179c8f23  telemetry_gpu_test.isolated

    The hash is the number on the left.
  5. To force an isolate to be rebuilt, delete the file src/out/Release/[isolate name].isolated and rebuild it. This is necessary if you changed the binary or test files and building the _run target reports "no work to do".

Modifying the Recipe (Including Adding New Steps)

Retraining Recipes' Expectations

As described in the documentation for recipes, the primary way recipes are tested is to record the output of what commands they would have executed. Any changes to the recipe which affect the steps in any way, including the command line arguments that might be passed to any command or which steps are executed, require the recipe's expectations to be retrained. You will discover this if you attempt to git cl upload a change to the recipe and the presubmit checks fail with output like

FAIL: gpu/build_and_upload.win_release_gclient_revert_failure (/Work/kbr/tools/build/scripts/slave/recipes/gpu/build_and_upload.expected/win_release_gclient_revert_failure.json)
----------------------------------------------------------------------
*** expected
--- current
***************
*** 32,39 ****
--- 32,107 ----
            'src@204787',
            '--output-json',
            '/path/to/tmp/json'],
    'name': 'gclient sync',
+   '~followup_annotations': ['@@@STEP_LOG_LINE@json.output@{@@@',
+                             '@@@STEP_LOG_LINE@json.output@  "solutions": {@@@',
To retrain the recipes' expectations, run:

./build/scripts/slave/unittests/recipe_simulation_test.py train

This must be done on Linux only as of this writing. Otherwise, extraneous differences in paths will show up.

Then carefully examine the changed files. Make sure the differences in the commands are what you expect. This is the primary line of defense against breaking the bots.

Code Coverage

Recipes require 100% code coverage. It is not allowed to add a conditional to a recipe without tests that exercise both branches. For this reason, if you add a conditional, a new recipe module, or a new API to an existing module, it is very likely that you will need to either add a new test for it, or modify an existing one.

Study the GenTests() methods in the build_and_test, build_and_upload, and download_and_test recipes. These should give an idea of the form of the tests, and the situations where new tests are needed in order to provide 100% code coverage. See for example:
  • the top-of-tree ANGLE test in build_and_upload
  • the test of Blink issues on the trybot in build_and_test
  • the use of the master_class_name_for_testing property in download_and_test

Try Jobs for Recipes (or lack thereof)

As of this writing, it is unfortunately not possible to send try jobs of the GPU recipe for actual execution on the GPU try servers. This means that significant changes to the recipe must be handed very carefully. Always file a bug about changes to the GPU recipe, and point the BUG= line to it in associated CLs. Doing so will yield a clear timeline of all commits and reverts associated with a change to the recipe. Be prepared to use drover or "git revert" to roll back changes to the recipe which introduce breakage on the bots. Always provide at least a brief description of the reason for the revert in the CL, and provide more detail in the bug report, including excerpts of logs. (The links to logs expire after a not very long time.) Do not let bots stay red for an extended period of time while issues with the recipe are being fixed. 

Adding New Steps to the Recipe

It's straightforward to add new steps to the recipe. Follow the patterns in tools/build/scripts/slave/recipe_modules/gpu/api.py for either a new build step or a new test step.

All new tests running on the tryservers and main waterfall bots (chromium.gpu, chromium.webkit) must be open-source. Please see the Chromium testing guidelines for details on this policy. If it's simply impossible to open-source the test it is possible that it can be run on the chromium.gpu.fyi waterfall, but a better approach would be to create an open-source version of the test.

All new tests must be able to be run via isolates. If you are adding a new binary (unlikely), you need to add a new .isolate file in src/chrome/, and a new _run target to src/chrome/chrome_tests.gypi. Then add your isolate's name to the list in tools/build/scripts/slave/recipe_modules/gpu/common.py. If you're adding a new Telemetry based test (both likely and hopefully), it is likely that your new test or data files will already be covered by either telemetry.isolate or telemetry_gpu_test.isolate. Adjust the isolates as necessary. Create a new one if absolutely necessary.

Build and run your isolate locally before attempting to add it to the GPU recipe. See the subsection above entitled "Testing your own isolates with the download_and_test recipe" for instructions on setting up the needed GYP_DEFINES to build and upload your isolate to the isolate server. To run it locally, run src/tools/swarming_client/run_isolated.py with the appropriate arguments. For simple isolates (i.e., non-Telemetry based ones):

./src/tools/swarming_client/run_isolated.py -H [hash] -I https://isolateserver.appspot.com

The telemetry-based GPU tests currently use the same isolate for all the tests. In this case the invocation looks like (for example):

./src/tools/swarming_client/run_isolated.py -H [hash] -I https://isolateserver.appspot.com -- webgl_conformance --browser=release --show-stdout [additional telemetry arguments]

If you are adding a new build step, run the build_and_upload recipe locally to make sure it works.

If you are adding a new test step, it is recommended to first build its associated isolate out of your (separate) Chromium workspace and upload that to the isolate server. Then run the download_and_test step locally, passing the hash of your local build's isolate in the swarm_hashes dictionary. Copy the rest of the hashes from a recent build on one of the Release GPU bots running the same OS as your local machine.

Because currently it isn't possible to send try jobs of the recipe itself (see the section above), if you are adding a new test step, it is strongly recommended to:
  1. Check in the skeleton of the test to the Chromium workspace, with all of the tests within commented out.
  2. Commit the change to the GPU recipe adding the new step. This should trivially be green on all of the GPU bots.
  3. Send a try job to the GPU try servers uncommenting the tests. This will at least provide coverage on a subset of the GPUs on the main waterfall.
  4. If the try job looks good, commit it.
Note that the GPU try servers should be part of the commit queue soon, so soon it shouldn't be necessary to send these try jobs manually.

When you commit your change to the recipe:
  1. Make sure the tree is green first. Don't commit changes to the recipe if there is redness on the tree.
  2. Watch the tree after your commit. If any of the bots turn red because of your commit, revert, diagnose what happened, fix it, and re-land. (See also the next paragraph.)
Note also that changes to the recipe might be seen for the first time on the testers rather than the builders. If you add a new binary, you might find that the testers fail during the first execution of that recipe change, unable to find the isolate for the new binary. You could work around this by committing your recipe changes in two stages: the first which adds the compilation of the new binary, and the second which adds its execution, waiting for your changes to propagate through the waterfalls in between. Or, if this is the issue, just wait for a second build and see if the problem clears up. If not, revert.

Again, there is currently no support for try jobs of the recipe itself. Be careful when making changes!
Comments