Also, I am working on a separate change to store the expected images on the server, so that they don't have to be pulled down locally into the tree.
First, let's think about how one writes tests. Typically, there are two approaches.
The first, and most popular these days, is to write a self-contained test that checks its own output and simply announces "pass" or "fail". This is in fact the recommended way to write tests in WebKit, and is how the xUnit style of tests are usually written.
The second, is to separate the test from the output, and to use a driver that checks the output against an expected result (or baseline) to determine if the test passed or failed. This is how run_webkit_tests works.
Most people prefer the first approach because you have fewer files to maintain, and the purpose and correctness of the test is more obvious. However, in some cases (e.g., pixel tests in the renderer), this simply isn't possible (or, at least, practical).
Both approaches, however, have drawbacks, both in the normal case and in the "we expect this test to fail" case.
In the normal approach, a problem arises if there are actually multiple "correct" answers. One example is when writing Javascript tests, but the expected output is different in V8 and JSC. (E.g., if you were testing implementation-dependent features like stack traces). There are three workarounds for this, all of which are weak. The first is to rewrite the implementation to conform. Generally this is a good idea, but sometimes it might not be desirable or even possible. The second is to rewrite the test to accept either output. This is somewhat fragile, and requires the test to know about every possible implementation. The third is to not run the test, and to instead copy and paste the test into a new test, and then modify the output and correct that. Both of the latter approaches leads to confusion (which version is "correct", and why?) and maintenance issues (if something changes, do I have to modify both versions? What should the "correct" output be).
The second approach, obviously, is just the second workaround to the first approach, codified into different files. The second approach is perhaps preferable where multiple correct results are the norm, rather than the exception, which is why we use it to compare PNGs across multiple platforms.
Now, in the case where tests fail, one can actually view the failure as a different kind of "correct" - i.e., we know the output is wrong, but it's an "expected" diff, and in most cases, we want to know if the diff changes from what we expect. Perhaps we actually fixed the bug? Perhaps we introduced a new bug? In fact, one could argue that platform-specific baselines are "expected" wrong baselines.
Tracking "expected diffs" introduces its own woes - what if the diff output is not deterministic? Or, and more importantly, how do you distinguish "expected wrong diff" from "expected right diff"?
Lastly, one could argue that we should spend more time fixing the bugs that cause the diffs, and less time tracking diffs :) Unfortunately, it's a lot faster to baseline expected diffs then it is to fix them :(