, ,

Many of our frequent intermittent test failures are timeouts. There are a lot of ways that a test – or a test job – can time out. Some popular bug titles demonstrate the range of failure messages:

  • This test exceeded the timeout threshold. It should be rewritten or split up. If that’s not possible, use requestLongerTimeout(N), but only as a last resort.
  • Test timed out.
  • TimeoutException: Timed out after … seconds
  • application ran for longer than allowed maximum time
  • application timed out after … seconds with no output
  • Task timeout after 3600 seconds. Force killing container.

We have tried re-wording some of these messages with the aim of clarifying the cause of the timeout and possible remedies, but I still see lots of confusion in bugs. In some cases, I think a complete explanation is much more involved than we can hope to express in an error message. I think we should write up a wiki page or MDN article with detailed explanations of messages like this, and point to that page from error messages in the test log.

One of the first things I do when I see a test failure due to timeout is look for a successful run of the same test on the same platform, and then compare the timing between the success and failure cases. If a test takes 4 seconds to run in the success case but times out after 45 seconds, perhaps there is an intermittent hang; but if the test takes 40 seconds to run successfully and intermittently times out after 45 seconds, it’s probably just a long running test with normal variation in run time.

This suggests some nice-to-have tools:

  • push a new test to try, get a report of how long your test runs on each platform, perhaps with a warning if run-time approaches known time-outs, or perhaps some arbitrary threshold;
  • same for longest duration without output (avoid “no output timeout”);
  • use custom code or a special test harness mode to identify existing long-running tests, for proactive follow-up to prevent timeouts in the future.