Test Ownership

Tags

, ,

When a bug for an intermittent test failure needs attention, who should be contacted? Who is responsible for fixing that bug? For as long as I have been at Mozilla, I have heard people ask variations of this question, and I have never heard a clear answer.

There are at least two problematic approaches that are sometimes suggested:

  1. The test author: Many test authors are no longer active contributors. Even if they are still active at Mozilla, they may not have modified the test or worked on the associated project for years. Also, making test authors responsible for their tests in perpetuity may dissuade many contributors from writing tests at all!
  2. The last person to modify the test: Many failing tests have been modified recently, so the last person to modify the test may be well-informed about the test and may be in the best position to fix it. But recent changes may be trivial and tangential to the test. And if the test hasn’t been modified recently, this option may revert to the test author, or someone else who isn’t actively working in the area or is no longer familiar with the code.

There are at least two seemingly viable approaches:

  1. “You broke it, you fix it”: The person who authored the changeset that initiated the intermittent test failure must fix the intermittent test failure, or back out their change.
  2. The module owner for the module associated with the test is responsible for the test and must find someone to fix the intermittent test failure, or disable the test.

Let’s have a closer look at these options.

The “you broke it, you fix it” model is appealing because it is a continuation of a principle we accept whenever we check in code: If your change immediately breaks tests or is otherwise obviously faulty, you expect to have your change backed out unless you can provide an immediate fix. If your change causes an intermittent failure, why should it be treated differently? The sheriffs might not immediately associate the intermittent failure with your change, but with time, most frequent intermittent failures can be traced back to the associated changeset, by repeating the test on a range of changesets. Once this relationship between changeset and failure is determined, the changeset needs to be fixed or backed out.

A problem with “you broke it, you fix it” is that it is sometimes difficult and/or time-consuming to find the changeset that started the intermittent. The less frequent the intermittent, the more tests need to be backfilled and repeated before a statistically significant number of test passes can be accepted as evidence that the test is passing reliably. That takes time, test resources, etc.

Sometimes, even when that changeset is identified, it’s hard to see a connection between the change and the failing test. Was the test always faulty, but just happened to pass until a patch modified the timing or memory layout or something like that? That’s a possibility that always comes to mind when the connection between changeset and failing test is less than obvious.

Finally, if the changeset author is not invested in the test, or not familiar with the importance of the test, they may be more inclined to simply skip the test or mark it as failing.

The “module owner” approach is appealing because it reinforces the Mozilla module owner system: Tests are just code, and the code belongs to a module with a responsible owner. Practically, ‘mach file-info bugzilla-component <test-path>’ can quickly determine the bugzilla component, and nearly all bugzilla components now have triage owners (who are hopefully approved by the module owner and knowledgeable about the module).

Module and triage owners ought to be more familiar with the failing test and the features under test than others, especially people who normally work on other modules. They may have a greater interest in properly fixing a test than someone who has only come to the test because their changeset triggered an intermittent failure.

Also, intermittent failures are often indicative of faulty tests: A “good” test passes when the feature under test is working, and it fails when the feature is broken. An intermittently failing test suggests the test is not reliable, so the test’s module owner should be ultimately responsible for improving the test. (But sometimes the feature under test is unreliable, or is made unreliable by a fault in another feature or module.)

A risk I see with the module owner approach is that it potentially shifts responsibility away from those who are introducing problems: If my patch is good enough to avoid immediate backout, any intermittent test failures I cause in other people’s modules is no longer my concern.

 

As part of the Stockwell project, :jmaher and I have been using a hybrid approach to find developers to work on frequent intermittent test failure bugs. We regularly triage, using tools like OrangeFactor to identify the most troublesome intermittent failures and then try to find someone to work on those bugs. I often use a procedure like this:

  1.  Does hg history show the test was modified just before it started failing? Ping the author of the patch that updated the test.
  2.  Can I retrigger the test a reasonable number of times to track down the changeset associated with the start of the failures? Ping the changeset author.
  3.  Does hg history indicate significant recent changes to the test by one person? Ask that person if they will look at the test, since they are familiar with it.
  4.  If all else fails, ping the triage owner.

This triage procedure has been a great learning experience for me, and I think it has helped move lots of bugs toward resolution sooner, reducing the number of intermittent failures we all need to deal with, but this doesn’t seem like a sustainable mode of operation. Retriggering to find the regression can be especially time consuming and is sometimes not successful. We sometimes have 50 or more frequent intermittent failure bugs to deal with, we have limited time for triage, and while we are bisecting, the test is failing.

I’d much prefer a simple way of determining an owner for problematic intermittents…but I wonder if that’s realistic.  While I am frustrated by the times I’ve tracked down a regressing changeset only to find that the author feels they are not responsible, I have also been delighted to find changeset authors who seem to immediately see the problem with their patch. Test authors sometimes step up with genuine concern for “their” test. And triage owners sometimes know, for instance, that a feature is obsolete and the test should be disabled. So there seems to be some value in all these approaches to finding an owner for intermittent failures…and none of the options are perfect.

 

When a bug for an intermittent test failure needs attention, who should be contacted? Who is responsible for fixing that bug? Sorry, no clear answer here either! Do you have a better answer? Let me know!

Neglected Oranges

Tags

I wrote earlier about my initial experience with triaging frequent intermittent test failures. I was happy to find that most of the most-frequent test failures were under active investigation, but that also meant that finding important bugs in need of triage was a frustrating and time consuming process.

Thankfully, :ekyle provided me with a script to identify “neglected oranges”: Frequent intermittent test failure bugs with no recent comments. The neglected oranges script provides search results not unlike the default search on Orange Factor, but filters out bugs with recent comments from non-robots. It also shows the bug age and how long it has been since the last comment:

neglectedoranges

This has provided a treasure trove of bugs for triage.

So, now that I can find bugs for frequent intermittent failures that don’t have anyone actively working on them, can I instigate action? Does this type of triage lead to bug resolution and a reduction in Orange Factor (average number of failures per push)? Here’s one way of looking at it: If I look at the bugs I’ve recently triaged and look at the time those bugs were open before I commented on them, I find that, on average, those bugs were open for 65 days before my triage comment. Typically I tried to find someone familiar with the bug and pointed out that it was a frequently failing test; sometimes I offered some insight, or suggested some action (“this is a timeout in a long-running test; if it cannot be optimized or split up, requestLongerTimeout() should avoid the timeout”). On average, those bugs were resolved within 3 days of my triage comment. Wow!

I offer this evidence that triage of neglected oranges makes a difference, but also caution not to expect that much of a difference over time: I’ve chosen bugs that were open for months and with continued triage, we may quickly eliminate these long-neglected bugs (let’s hope!). I’ve also likely chosen “easy” bugs – bugs with an obvious, or at least apparent, resolution. There will also be intractable bugs, surely, and bugs without any apparent owner, or where interested parties cannot agree on a solution.

It is similarly difficult to draw conclusions from Orange Factor failure rates, but let’s look at those anyway, roughly for the time period I have been triaging:

of-oct

That’s encouraging, isn’t it? I don’t know how much of that improvement was instigated by my triage comments, but I like to think I have contributed to the improvement, and that this type of action can continue to drive down failure rates. I’ll keep spending at least a few hours each week on neglected oranges, and see how that goes for the next couple of months. Can we bring Orange Factor under 10? Under 5?

Timeout Triage

Tags

, ,

Many of our frequent intermittent test failures are timeouts. There are a lot of ways that a test – or a test job – can time out. Some popular bug titles demonstrate the range of failure messages:

  • This test exceeded the timeout threshold. It should be rewritten or split up. If that’s not possible, use requestLongerTimeout(N), but only as a last resort.
  • Test timed out.
  • TEST-UNEXPECTED-TIMEOUT
  • TimeoutException: Timed out after … seconds
  • application ran for longer than allowed maximum time
  • application timed out after … seconds with no output
  • Task timeout after 3600 seconds. Force killing container.

We have tried re-wording some of these messages with the aim of clarifying the cause of the timeout and possible remedies, but I still see lots of confusion in bugs. In some cases, I think a complete explanation is much more involved than we can hope to express in an error message. I think we should write up a wiki page or MDN article with detailed explanations of messages like this, and point to that page from error messages in the test log.

One of the first things I do when I see a test failure due to timeout is look for a successful run of the same test on the same platform, and then compare the timing between the success and failure cases. If a test takes 4 seconds to run in the success case but times out after 45 seconds, perhaps there is an intermittent hang; but if the test takes 40 seconds to run successfully and intermittently times out after 45 seconds, it’s probably just a long running test with normal variation in run time.

This suggests some nice-to-have tools:

  • push a new test to try, get a report of how long your test runs on each platform, perhaps with a warning if run-time approaches known time-outs, or perhaps some arbitrary threshold;
  • same for longest duration without output (avoid “no output timeout”);
  • use custom code or a special test harness mode to identify existing long-running tests, for proactive follow-up to prevent timeouts in the future.

Triaging with Orange Factor

Tags

, ,

Recently, I have been trying to spend a little time each day looking over the most frequent intermittent test failures in search of neglected bugs. I use Orange Factor to identify the most frequent failures, then scan the associated bugs in bugzilla to see if there is someone actively working on the bug.

I have had some encouraging successes. For example, in bug 1307388, I found a frequent intermittent with no one assigned and no sign of activity. The test had started failing recently – a few days earlier – with no sign of failures before that. A quick check of the mercurial logs showed that the test had been modified the day that it started failing, and a needinfo of the patch author led to immediate action.

In bug 1244707, the bug had been triaged several months ago and assigned to backlog, but the failure frequency had since increased dramatically. Pinging someone familiar with the test quickly led to discussion and resolution.

My experience in each of these cases was really rewarding: It took me just a few minutes to review the bug and bring it to the attention of someone who was interested and understood the failure.

Finding neglected bugs is more onerous. Orange Factor can be used to identify frequent test failures; the default view on https://brasstacks.mozilla.com/orangefactor/ provides a list, ordered by frequency, but most of those are not neglected — some one is already working on them and they just need time to investigate and land a fix. I think the sheriffs do a good job of finding owners for frequent intermittents, so it seems like 90% of the top intermittents have owners, and they are usually actively working on resolving those issues. I don’t think there’s any way to see that activity on Orange Factor:

of

So I end up opening lots of bugs each day before I find one that “needs help”. Broadly speaking, I’m looking for a search for bugs matching something like:

  •  intermittent test failure
  •  fails frequently (OrangeFactor Robot’s daily comment?)
  •  no recent (last 7 days?) human-generated (not OrangeFactor Robot) bug comments

OrangeFactor does a good job of identifying the frequent failures, but I don’t think it has any data on bug activity…and this notion of bug activity is hazy anyway. Ping me if you have a better intermittent orange triage procedure, or thoughts on how to do this more efficiently.

** Update – I’ve been getting lots of ideas from folks on irc for better triaging:

ryanvm

  • look to aurora/beta for bugs that have been around for longer
  • would be nice if a dashboard would show trends for a bug (now happening more frequently, etc) – like socorro
  • bugzilla data fed to presto, so marrying it to treeherder with redash may be possible (mdoglio may know more)

wlach

  • might be able to use redash for change detection/trends once treeherder’s db is hooked up to it

ekyle

  •  there’s an OrangeFactorv2 planned
  •  the bugzilla es cluster has all bug data in easy to query format

Skipping persistent intermittent failures

Tags

, ,

Our automated tests seem to fail a lot. Instead of a sea of green, a typical good push often looks more like:

central

I’ve been thinking about ways that we can improve on that: Ways that we can reduce those pesky intermittent oranges.

Here’s one idea: Be more aggressive about disabling (skipping) tests that fail intermittently.

For today anyway, let’s put aside those tests that fail infrequently. If a test fails only rarely, there’s less to be gained by skipping it. It may also be harder to reproduce such failures, and harder to fix them and get them running again.

Instead, let’s concentrate (for now) on frequent, persistent test failures. There are lots of them:

of

Notice that the most frequent intermittent failure for this one-week period is bug 1157948, which failed 721 times (well, it was reported/starred 721 times — it probably failed more than that!). Guess what happened the week before that? Yeah, another 700 or so oranges. And the week before that and … This is definitely a persistent, frequent intermittent failure.

I am actually intimately familiar with bug 1157948. I’ve worked hard to resolve it, and lots of other people have too, and I’m hopeful that a fix is landing for it right now. Still, it took over 3 months to fix this. What did we gain by running the affected tests for those 3 months? Was it worth the 10000+ failures that sheriffs and developers saw, read, diagnosed, and starred?

Bug 1157948 affected all taskcluster-initiated Android tests, so skipping the affected tests would have meant losing a lot of coverage. But it is not difficult to find other bugs with over 100 failures per week that affect just one test (like bug 1305601, just to point out an example). It would be easy to disable (skip-if annotate) this test while we work on it, and wouldn’t that be better? It won’t be fixed overnight, but it will continue to fail overnight — and there’s a cost to that.

There’s a trade-off here for sure. A skipped test means less coverage. If another change causes a spontaneous fix to this test, we won’t notice the change if it is skipped. And we won’t notice a change in the frequency of failures. How important are these considerations, and are they important enough that we can live with seeing, reporting, and tracking all these test failures?

I’m not yet sure about the particulars of when and how to skip intermittent failures, but it feels like we would profit by being more aggressive about skipping troublesome tests, particularly those that fail frequently and persistently.

Firefox for Android Performance Measures – Q3 Check-up

Tags

,

Highlights:

  •  Recent outstanding improvements in APK size, memory use, and startup time, all due to :esawin’s efforts in bug 1291424.

APK Size

You can see the size of every build on treeherder using Perfherder.

Here’s how the APK size changed over the quarter, for mozilla-central Android 4.0 API15+ opt builds:

screenshot-from-2016-09-30-13-19-28

As seen in the past, the APK size seems to gradually increase over time. But this quarter there is a pleasant surprise, with a recent very large improvement. That is :esawin’s change from bug 1291424. Nice!

 

Memory

We track some memory metrics using test_awsy_lite.

screenshot-from-2016-09-30-13-20-31

Again, there is a tremendous improvement with bug 1291424. Thankyou :esawin!

Autophone-Talos

This section tracks Perfherder graphs for mozilla-central builds of Firefox for Android, for Talos tests run on Autophone, on android-6-0-armv8-api15. The test names shown are those used on treeherder. See https://wiki.mozilla.org/Buildbot/Talos for background on Talos.

tsvgx

An svg-only number that measures SVG rendering performance. About half of the tests are animations or iterations of rendering. This ASAP test (tsvgx) iterates in unlimited frame-rate mode thus reflecting the maximum rendering throughput of each test. The reported value is the page load time, or, for animations/iterations – overall duration the sequence/animation took to complete. Lower values are better.

tp4m

Generic page load test. Lower values are better.

screenshot-from-2016-09-30-13-34-23

No significant improvements or regressions noted for tsvgx or tp4m.

Autophone

Throbber Start / Throbber Stop

Browser startup performance is measured on real phones (a variety of popular devices).

Here’s a quick summary for the local blank page test on various devices:

screenshot-from-2016-09-30-13-51-37

Again, there is an excellent performance improvement with bug 1291424. Yahoo!

See bug 953342 to track autophone throbber regressions (none this quarter).

Firefox for Android Performance Measures – Q2 Check-up

Tags

,

Highlights:

  • gradual increases in APK size and memory use
  • not much change in tsvgx or tp4m
  • autophone throbber data available in perfherder

APK Size

You can see the size of every build on treeherder using Perfherder.

Here’s how the APK size changed over the quarter, for mozilla-central Android 4.0 API15+ opt builds:

 

apk-size

APK size generally grew, generally in small increments. Our APK is about 1.3 MB larger today than it was 3 months ago. The largest increase, of about 400 KB around May 4, was caused by and discussed in bug 1260208. The largest decrease, of about 200 KB around April 25, was caused by bug 1266102.

For the same period, libxul.so also generally grew gradually:

libxul

 

Memory

We track some memory metrics using test_awsy_lite.

awsy

These memory measurements are fairly steady over the quarter, with a gradual increase over time.

Autophone-Talos

This section tracks Perfherder graphs for mozilla-central builds of Firefox for Android, for Talos tests run on Autophone, on android-6-0-armv8-api15. The test names shown are those used on treeherder. See https://wiki.mozilla.org/Buildbot/Talos for background on Talos.

tsvgx

An svg-only number that measures SVG rendering performance. About half of the tests are animations or iterations of rendering. This ASAP test (tsvgx) iterates in unlimited frame-rate mode thus reflecting the maximum rendering throughput of each test. The reported value is the page load time, or, for animations/iterations – overall duration the sequence/animation took to complete. Lower values are better.

tp4m

Generic page load test. Lower values are better.

talos

No significant improvements or regressions noted for tsvgx or tp4m.

Autophone

Throbber Start / Throbber Stop

Browser startup performance is measured on real phones (a variety of popular devices).

For the first time on this blog, I’ve pulled this graph from Perfherder, rather than phonedash. A wealth of throbber start/throbber stop data is now available in Perfherder. Here’s a quick summary for the local blank page test on various devices:

throbber

See bug 953342 to track autophone throbber regressions.

 

Firefox for Android Performance Measures – Q1 Check-up

Tags

,

Highlights:

  • APK size reduction for downloadable fonts
  • now measuring memory via test_awsy_lite
  • tsvgx and tp4m moved to Autophone

APK Size

You can see the size of every build on treeherder using Perfherder.

Here’s how the APK size changed over the quarter, for mozilla-central Android 4.0 API15+ opt builds:

apk

The dramatic decrease in February was caused by bug 1233799, which enabled the download content service and removed fonts from the APK.

For the same period, libxul.so generally increased in size:

libxul

The recent decrease in libxul was caused by bug 1259521, an upgrade of the Android NDK.

Memory

This quarter we began tracking some memory metrics, using test_awsy_lite.

awsy

These memory measurements are generally steady over the quarter, with some small improvements.

Autophone-Talos

This section tracks Perfherder graphs for mozilla-central builds of Firefox for Android, for Talos tests run on Autophone, on android-6-0-armv8-api15. The test names shown are those used on treeherder. See https://wiki.mozilla.org/Buildbot/Talos for background on Talos.

In previous quarters, these tests were running on Pandaboards; beginning this quarter, these tests run on actual phones via Autophone.

tsvgx

An svg-only number that measures SVG rendering performance. About half of the tests are animations or iterations of rendering. This ASAP test (tsvgx) iterates in unlimited frame-rate mode thus reflecting the maximum rendering throughput of each test. The reported value is the page load time, or, for animations/iterations – overall duration the sequence/animation took to complete. Lower values are better.

tp4m

Generic page load test. Lower values are better.

talos

No significant improvements or regressions noted for tsvgx or tp4m.

Autophone

Throbber Start / Throbber Stop

These graphs are taken from http://phonedash.mozilla.org.  Browser startup performance is measured on real phones (a variety of popular devices).

throbberstart

throbstop

There was a lot of work on Autophone this quarter, with new devices added and old devices retired or re-purposed. These graphs show devices running mozilla-central builds, of which none were in continuous use over the quarter.

Throbber Start/Stop test regressions are tracked by bug 953342; a recent regression in throbber start is under investigation in bug 1259479.

mozbench

mozbench has been retired. 😦

Long live arewefastyet.com! 🙂 I’ll check in on arewefastyet.com next quarter.

Reduce, reuse, recycle

Tags

, ,

As Firefox for Android drops support for ancient versions of Android, I find my collection of test phones becoming less and less relevant. For instance, I have a Galaxy S that works fine but only runs Android 2.2.1 (API 8), and I have a Galaxy Nexus that runs Android 4.0.1 (API 14). I cannot run current builds of Firefox for Android on either phone, and, perhaps because I rooted them or otherwise messed around with them in the distant past, neither phone will upgrade to a newer version of Android.

I have been letting these phones gather dust while I test on emulators, but I recently needed a real phone and managed to breathe new life into the Galaxy Nexus using an AOSP build. I wanted all the development bells and whistles and a root shell, so I made a full-eng build and I updated the Galaxy Nexus to Android 4.3 (api 18) — good enough for Firefox for Android, at least for a while!

Basically, I followed the instructions at https://source.android.com/source/requirements.html, building on Ubuntu 14.04. For the Galaxy Nexus, that broke down to:

mkdir aosp
cd aosp
repo init -u https://android.googlesource.com/platform/manifest -b android-4.3_r1 # Galaxy Nexus
repo sync (this can take several hours)
# Download all binaries from the relevant section of 
#   https://developers.google.com/android/nexus/drivers .
# I used "Galaxy Nexus (GSM/HSPA+) binaries for Android 4.3 (JWR66Y)".
# Extract each (6x) downloaded archive, extracting into <aosp>.
# Execute each (6x) .sh and accept prompts, populating <aosp>/vendor.
source build/envsetup.sh
lunch full_maguro-eng
# use update-alternatives to select Java 6; I needed all 5 of these
sudo update-alternatives --config java
sudo update-alternatives --config javac
sudo update-alternatives --config javah
sudo update-alternatives --config javadoc
sudo update-alternatives --config javap
make -j4 (this can take a couple of hours)

Once make completes, I had binaries in <aosp>/out/… I put the phone in bootloader mode (hold down Volume Up + Volume Down + Power to boot Galaxy Nexus), connected it by  USB and executed “fastboot -w flashall”.

Actually, in my case, fastboot could not see the connected device, unless I ran it from root. In the root account, I didn’t have the right settings, so I needed to do something like:

sudo /bin/bash
source build/envsetup.sh
lunch full_maguro-eng
fastboot -w flashall
exit

If you are following along, don’t forget to undo your java update-alternatives when you are done!

It took some time to download and build, but the procedure was fairly straight-forward and the results excellent: I feel like I have a new phone, perfectly clean and functional — and rooted!

(I have had no similar luck with the Galaxy S: AOSP binaries are only supplied for Nexus devices, and I see no AOSP instructions for the Galaxy S. Maybe it’s time to recycle this one.)

test_awsy_lite

Tags

,

Bug 1233220 added a new Android-only mochitest-chrome test called test_awsy_lite.html. Inspired by https://www.areweslimyet.com/mobile/, test_awsy_lite runs similar code and takes similar measurements to areweslimyet.com, but runs as a simple mochitest and reports results to Perfherder.

There are some interesting trade-offs to this approach to performance testing, compared to running a custom harness like areweslimyet.com or Talos.

+ Writing and adding a mochitest is very simple.

+ It is easy to report to Perfherder (see http://wrla.ch/blog/2015/11/perfherder-onward/).

+ Tests can be run locally to reproduce and debug test failures or irregularities.

+ There’s no special hardware to maintain. This is a big win compared to ad-hoc systems that might fail because someone kicks the phone hanging off the laptop that’s been tucked under their desk, or because of network changes, or failing hardware. areweslimyet.com/mobile was plagued by problems like this and hasn’t produced results in over a year.

? Your new mochitest is automatically run on every push…unless the test job is coalesced or optimized away by SETA.

? Results are tracked in Perfherder. I am a big fan of Perfherder and think it has a solid UI that works for a variety of data (APK sizes, build times, Talos results). I expect Perfherder will accommodate test_awsy_lite data too, but some comparisons may be less convenient to view in Perfherder compared to a custom UI, like areweslimyet.com.

– For Android, mochitests are run only on Android emulators, running on aws. That may not be representative of performance on real phones — but I’m hoping memory use is similar on emulators.

– Tests cannot run for too long. Some Talos and other performance tests run many iterations or pause for long periods of time, resulting in run-times of 20 minutes or more. Generally, a mochitest should not run for that long and will probably cause some sort of timeout if it does.

For test_awsy_lite.html, I took a few short-cuts, worth noting:

  •  test_awsy_lite only reports “Resident memory” (RSS); other measurements like “Explicit memory” should be easy to add;
  •  test_awsy_lite loads fewer pages than areweslimyet.com/mobile, to keep run-time manageable; it runs in about 10 minutes, using about 6.5 minutes for page loads.

Results are in Perfherder. Add data for “android-2-3-armv7-api9” or “android-4-3-armv7-api15” and you will see various tests named “Resident Memory …”, each corresponding to a traditional areweslimyet.com measurement.

perfh