A few weeks ago, I was trying to run Linux Debug mochitests in an unfamiliar environment and that got me to thinking about how well tests run on different computers. How much does the run-time environment – the hardware, the OS, system applications, UI, etc. – affect the reliability of tests?
At that time, Linux 64 Debug plain, non-e10s mochitests on treeherder – M(1) .. M(5) – were running well: Nearly all jobs were green. The most frequent intermittent failure was dom/html/test/test_fullscreen-api-race.html, but even that test failed only about 1 time in 10. I wondered, are those tests as reliable in other environments? Do intermittent failures reproduce with the same frequency on other computers?
Experiment: Borrow a test slave, run tests over VNC
I borrowed an aws test slave – see https://wiki.mozilla.org/ReleaseEngineering/How_To/Request_a_slave – and used VNC to access the slave and run tests. I downloaded builds and test packages from mozilla-central and invoked run_tests.py with the same arguments used for the automated tests shown on treeherder. To save time, I restricted my tests to mochitest-1, but I repeated mochitest-1 10 times. All tests passed all 10 times. Additional runs produced intermittent failures, like test_fullscreen-api-race, with approximately the same frequency reported by Orange Factor for recent builds. tl;dr Treeherder results, including intermittent failures, for mochitests can be reliably reproduced on borrowed slaves accessed with VNC.
Experiment: Run tests on my laptop
Next I tried running tests on my laptop, a ThinkPad w540 running Ubuntu 14. I downloaded the same builds and test packages from mozilla-central and invoked run_tests.py with the same arguments used for the automated tests shown on treeherder. This time I noticed different results immediately: several tests in mochitest-1 failed consistently. I investigated and tracked down some failures to environmental causes: essential components like pulseaudio or gstreamer not installed or not configured correctly. Once I corrected those issues, I still had a few permanent test failures (like dom/base/test/test_applet_alternate_content.html, which has no bugs on file) and very frequent intermittents (like dom/base/test/test_bug704320_policyset.html, which is decidedly low-frequency in Orange Factor). I also could not reproduce the most frequent mochitest-1 intermittents I found on Orange Factor and reproduced earlier on the borrowed slave. An intermittent failure like test_fullscreen-api-race, which I could generally reproduce at least once in 10 to 20 runs on a borrowed slave, I could not reproduce at all in over 100 runs on my laptop. (That’s 100 runs of the entire mochitest-1 job. I also tried running specific tests or specific directories of tests up to 1000 times, but I still could not reproduce the most common intermittent failures seen on treeherder.) tl;dr Intermittent failures seen on treeherder are frequently impossible to reproduce on my laptop; some failures seen on my laptop have never been reported before.
Experiment: Run tests on a Digital Ocean instance
Digital Ocean offers virtual servers in the cloud, similar to AWS EC2. Digital Ocean is of interest because rr can be used on Digital Ocean but not on aws. I repeated my test runs, again with the same methodology, on a Digital Ocean instance set up earlier this year for Orange Hunter.
My experience on Digital Ocean was very similar to that on my own laptop. Most tests pass, but there are some failures seen on Digital Ocean that are not seen on treeherder and not seen on my laptop, and intermittent failures which occur with some frequency on treeherder could not be reproduced on Digital Ocean.
tl;dr Intermittent failures seen on treeherder are frequently impossible to reproduce on Digital Ocean; some failures seen on Digital Ocean have never been reported before; failures on Digital Ocean are also different (or of different frequency) from those seen on my laptop.
I found it relatively easy to run Linux Debug mochitests in various environments in a manner similar to the test jobs we see on treeherder. Test results were similar to treeherder, in that most tests passed. That’s all good, and expected.
However, test results often differed in small but significant ways across environments and I could not reproduce most frequent intermittent failures seen on treeherder and tracked in Orange Factor. This is rather discouraging and the cause of the concern mentioned in my last post: While rr appears to be an excellent tool for recording and replaying intermittent test failures and seems to have minimal impact on the chances of reproducing an intermittent failure, rr cannot be run on the aws instances used to run Firefox tests in continuous integration, and it seems difficult to reproduce many intermittent test failures in different environments. (I don’t have a good sense of why this is: Timing differences, hardware, OS, system configuration?)
If rr could be run on aws, all would be grand: We could record test runs in aws with excellent chances of reproducing and recording intermittent test failures and could make those recordings available to developers interested in debugging the failures. But I don’t think that’s possible.
We had hoped that we could run tests in another environment (Digital Ocean) and observe the same failures seen on aws and reported in treeherder, but that doesn’t seem to be the case.
Another possibility is bug 1226676: We hope to start running Linux tests in a docker container soon. Once that’s working, if rr can be run in the container, perhaps intermittent failures will behave the same way and can be reproduced and recorded.