, ,

Tests running on our new-ish Android 4.3 Opt emulator platform have recently been plagued by intermittent timeouts and I have been having a closer look at some of them (like bug 919246 and bug 1154505) .

A few of these tests normally run “quickly”. Think of a test that runs to completion in under 10 seconds most of the time but times out after 300+ seconds intermittently. In a case like this, it seems likely that there is an intermittent hang and the test needs debugging to determine the underlying cause.

But most of the recent Android 4.3 Opt test timeouts seem to be affecting what I classify as “long-running” tests. Think of a test that normally runs to completion in 250 to 299 seconds, but intermittently times out after 300 seconds. It seems likely that normal variations in test duration are intermittently pushing past the timeout threshold; if we can tolerate a longer time-out, or make the test run faster in general, we can probably eliminate the intermittent test failure.

We have a lot of options for dealing with long-running tests that sometimes timeout.

Option: Simplify or optimize the test

Long-running tests are usually doing a lot of work. A lot of assertions can be run in 300 seconds, even on a slow platform! Do we need to test all of those cases, or could some be eliminated? Is there some setup or tear down code being run repeatedly that could be run just once, or even just less often?

We usually don’t worry about optimizing tests but sometimes a little effort can help a test run a lot more efficiently, saving test time, money (think aws costs), and aggravation like intermittent time-outs.

Option: Split the test into 2 or more smaller tests

Some tests can be split into 2 or more smaller tests with minimal effort. Instead of testing 100 different cases in one test, we may be able to test 50 in each. There may be some loss of efficiency: Maybe some setup code will need to be run twice, and copied and pasted to the second test. But now each half runs faster, reducing the chance of a timeout. And when one test fails, the cause is – at least slightly – more isolated.

Option: Request a longer timeout for the test

Mochitests can call SimpleTest.requestLongerTimeout(2) to double the length of the timeout applied to the test. We currently have about 100 mochitests that use this feature.

For xpcshell tests, the same thing can be accomplished with a manifest annotation:

requesttimeoutfactor = 2

That’s a really simple “fix” and an effective way of declaring that a test is known to be long-running.

On the other hand, it is avoiding the problem and potentially covering up an issue that could be solved more effectively by splitting, optimizing, or simplifying. Also, long-running tests make our test job “chunking” less effective: It’s harder to split load evenly amongst jobs when some tests run 100 times longer than others.

Option: Skip the test on slow platforms

Sometimes it’s not worth the effort. Do we really need to run this test on Android as well as on all the desktop platforms? Do we get value from running this test on both Android 2.3 and Android 4.3? We may “disable our way to victory” too often, but this is a simple strategy, doesn’t affect other platforms and sometimes it feels like the right thing to do.

Option: Run on faster hardware

This usually isn’t practical, but in special circumstances it seems like the best way forward.

If you have a lot of timeouts from long-running tests on one platform and those tests don’t timeout on other platforms, it may be time to take a closer look at the platform.

Our Android arm emulator test platforms are infamous for slowness. In fairness, the emulator has a lot of work to do, Firefox is complex, our tests are often relentless (compared to human-driven browsing), and we normally run the emulator on the remarkably slow (and cheap!) m1.medium AWS instances.

If we are willing to pay for better cpu, memory, and I/O capabilities, we can easily speed up the emulator by running on a faster AWS instance type — but the cost must be justified.

I recently tried running Android 4.3 Debug mochitests on m1.medium and found that many tests timed out. Also, since all tests were taking longer, each test job (each “chunk”) needed 2 to 3 hours to complete — much longer than we can wait. Increasing chunks seemed impractical (we would need 50 or so) and we would still have all those individual timeouts to deal with. In this case, running the emulator on c3.xlarge instances for Debug mochitests made a big difference, allowing them to run in the same number of chunks as Opt on m1.medium and eliminating nearly all timeouts.

I’ve enjoyed investigating mochitest timeouts and found most of them to be easy to resolve. I’ll try to investigate more timeouts as I see them. Won’t you join me?