Neglected Oranges

Tags

I wrote earlier about my initial experience with triaging frequent intermittent test failures. I was happy to find that most of the most-frequent test failures were under active investigation, but that also meant that finding important bugs in need of triage was a frustrating and time consuming process.

Thankfully, :ekyle provided me with a script to identify “neglected oranges”: Frequent intermittent test failure bugs with no recent comments. The neglected oranges script provides search results not unlike the default search on Orange Factor, but filters out bugs with recent comments from non-robots. It also shows the bug age and how long it has been since the last comment:

neglectedoranges

This has provided a treasure trove of bugs for triage.

So, now that I can find bugs for frequent intermittent failures that don’t have anyone actively working on them, can I instigate action? Does this type of triage lead to bug resolution and a reduction in Orange Factor (average number of failures per push)? Here’s one way of looking at it: If I look at the bugs I’ve recently triaged and look at the time those bugs were open before I commented on them, I find that, on average, those bugs were open for 65 days before my triage comment. Typically I tried to find someone familiar with the bug and pointed out that it was a frequently failing test; sometimes I offered some insight, or suggested some action (“this is a timeout in a long-running test; if it cannot be optimized or split up, requestLongerTimeout() should avoid the timeout”). On average, those bugs were resolved within 3 days of my triage comment. Wow!

I offer this evidence that triage of neglected oranges makes a difference, but also caution not to expect that much of a difference over time: I’ve chosen bugs that were open for months and with continued triage, we may quickly eliminate these long-neglected bugs (let’s hope!). I’ve also likely chosen “easy” bugs – bugs with an obvious, or at least apparent, resolution. There will also be intractable bugs, surely, and bugs without any apparent owner, or where interested parties cannot agree on a solution.

It is similarly difficult to draw conclusions from Orange Factor failure rates, but let’s look at those anyway, roughly for the time period I have been triaging:

of-oct

That’s encouraging, isn’t it? I don’t know how much of that improvement was instigated by my triage comments, but I like to think I have contributed to the improvement, and that this type of action can continue to drive down failure rates. I’ll keep spending at least a few hours each week on neglected oranges, and see how that goes for the next couple of months. Can we bring Orange Factor under 10? Under 5?

Timeout Triage

Tags

, ,

Many of our frequent intermittent test failures are timeouts. There are a lot of ways that a test – or a test job – can time out. Some popular bug titles demonstrate the range of failure messages:

  • This test exceeded the timeout threshold. It should be rewritten or split up. If that’s not possible, use requestLongerTimeout(N), but only as a last resort.
  • Test timed out.
  • TEST-UNEXPECTED-TIMEOUT
  • TimeoutException: Timed out after … seconds
  • application ran for longer than allowed maximum time
  • application timed out after … seconds with no output
  • Task timeout after 3600 seconds. Force killing container.

We have tried re-wording some of these messages with the aim of clarifying the cause of the timeout and possible remedies, but I still see lots of confusion in bugs. In some cases, I think a complete explanation is much more involved than we can hope to express in an error message. I think we should write up a wiki page or MDN article with detailed explanations of messages like this, and point to that page from error messages in the test log.

One of the first things I do when I see a test failure due to timeout is look for a successful run of the same test on the same platform, and then compare the timing between the success and failure cases. If a test takes 4 seconds to run in the success case but times out after 45 seconds, perhaps there is an intermittent hang; but if the test takes 40 seconds to run successfully and intermittently times out after 45 seconds, it’s probably just a long running test with normal variation in run time.

This suggests some nice-to-have tools:

  • push a new test to try, get a report of how long your test runs on each platform, perhaps with a warning if run-time approaches known time-outs, or perhaps some arbitrary threshold;
  • same for longest duration without output (avoid “no output timeout”);
  • use custom code or a special test harness mode to identify existing long-running tests, for proactive follow-up to prevent timeouts in the future.

Triaging with Orange Factor

Tags

, ,

Recently, I have been trying to spend a little time each day looking over the most frequent intermittent test failures in search of neglected bugs. I use Orange Factor to identify the most frequent failures, then scan the associated bugs in bugzilla to see if there is someone actively working on the bug.

I have had some encouraging successes. For example, in bug 1307388, I found a frequent intermittent with no one assigned and no sign of activity. The test had started failing recently – a few days earlier – with no sign of failures before that. A quick check of the mercurial logs showed that the test had been modified the day that it started failing, and a needinfo of the patch author led to immediate action.

In bug 1244707, the bug had been triaged several months ago and assigned to backlog, but the failure frequency had since increased dramatically. Pinging someone familiar with the test quickly led to discussion and resolution.

My experience in each of these cases was really rewarding: It took me just a few minutes to review the bug and bring it to the attention of someone who was interested and understood the failure.

Finding neglected bugs is more onerous. Orange Factor can be used to identify frequent test failures; the default view on https://brasstacks.mozilla.com/orangefactor/ provides a list, ordered by frequency, but most of those are not neglected — some one is already working on them and they just need time to investigate and land a fix. I think the sheriffs do a good job of finding owners for frequent intermittents, so it seems like 90% of the top intermittents have owners, and they are usually actively working on resolving those issues. I don’t think there’s any way to see that activity on Orange Factor:

of

So I end up opening lots of bugs each day before I find one that “needs help”. Broadly speaking, I’m looking for a search for bugs matching something like:

  •  intermittent test failure
  •  fails frequently (OrangeFactor Robot’s daily comment?)
  •  no recent (last 7 days?) human-generated (not OrangeFactor Robot) bug comments

OrangeFactor does a good job of identifying the frequent failures, but I don’t think it has any data on bug activity…and this notion of bug activity is hazy anyway. Ping me if you have a better intermittent orange triage procedure, or thoughts on how to do this more efficiently.

** Update – I’ve been getting lots of ideas from folks on irc for better triaging:

ryanvm

  • look to aurora/beta for bugs that have been around for longer
  • would be nice if a dashboard would show trends for a bug (now happening more frequently, etc) – like socorro
  • bugzilla data fed to presto, so marrying it to treeherder with redash may be possible (mdoglio may know more)

wlach

  • might be able to use redash for change detection/trends once treeherder’s db is hooked up to it

ekyle

  •  there’s an OrangeFactorv2 planned
  •  the bugzilla es cluster has all bug data in easy to query format

Skipping persistent intermittent failures

Tags

, ,

Our automated tests seem to fail a lot. Instead of a sea of green, a typical good push often looks more like:

central

I’ve been thinking about ways that we can improve on that: Ways that we can reduce those pesky intermittent oranges.

Here’s one idea: Be more aggressive about disabling (skipping) tests that fail intermittently.

For today anyway, let’s put aside those tests that fail infrequently. If a test fails only rarely, there’s less to be gained by skipping it. It may also be harder to reproduce such failures, and harder to fix them and get them running again.

Instead, let’s concentrate (for now) on frequent, persistent test failures. There are lots of them:

of

Notice that the most frequent intermittent failure for this one-week period is bug 1157948, which failed 721 times (well, it was reported/starred 721 times — it probably failed more than that!). Guess what happened the week before that? Yeah, another 700 or so oranges. And the week before that and … This is definitely a persistent, frequent intermittent failure.

I am actually intimately familiar with bug 1157948. I’ve worked hard to resolve it, and lots of other people have too, and I’m hopeful that a fix is landing for it right now. Still, it took over 3 months to fix this. What did we gain by running the affected tests for those 3 months? Was it worth the 10000+ failures that sheriffs and developers saw, read, diagnosed, and starred?

Bug 1157948 affected all taskcluster-initiated Android tests, so skipping the affected tests would have meant losing a lot of coverage. But it is not difficult to find other bugs with over 100 failures per week that affect just one test (like bug 1305601, just to point out an example). It would be easy to disable (skip-if annotate) this test while we work on it, and wouldn’t that be better? It won’t be fixed overnight, but it will continue to fail overnight — and there’s a cost to that.

There’s a trade-off here for sure. A skipped test means less coverage. If another change causes a spontaneous fix to this test, we won’t notice the change if it is skipped. And we won’t notice a change in the frequency of failures. How important are these considerations, and are they important enough that we can live with seeing, reporting, and tracking all these test failures?

I’m not yet sure about the particulars of when and how to skip intermittent failures, but it feels like we would profit by being more aggressive about skipping troublesome tests, particularly those that fail frequently and persistently.

Firefox for Android Performance Measures – Q3 Check-up

Tags

,

Highlights:

  •  Recent outstanding improvements in APK size, memory use, and startup time, all due to :esawin’s efforts in bug 1291424.

APK Size

You can see the size of every build on treeherder using Perfherder.

Here’s how the APK size changed over the quarter, for mozilla-central Android 4.0 API15+ opt builds:

screenshot-from-2016-09-30-13-19-28

As seen in the past, the APK size seems to gradually increase over time. But this quarter there is a pleasant surprise, with a recent very large improvement. That is :esawin’s change from bug 1291424. Nice!

 

Memory

We track some memory metrics using test_awsy_lite.

screenshot-from-2016-09-30-13-20-31

Again, there is a tremendous improvement with bug 1291424. Thankyou :esawin!

Autophone-Talos

This section tracks Perfherder graphs for mozilla-central builds of Firefox for Android, for Talos tests run on Autophone, on android-6-0-armv8-api15. The test names shown are those used on treeherder. See https://wiki.mozilla.org/Buildbot/Talos for background on Talos.

tsvgx

An svg-only number that measures SVG rendering performance. About half of the tests are animations or iterations of rendering. This ASAP test (tsvgx) iterates in unlimited frame-rate mode thus reflecting the maximum rendering throughput of each test. The reported value is the page load time, or, for animations/iterations – overall duration the sequence/animation took to complete. Lower values are better.

tp4m

Generic page load test. Lower values are better.

screenshot-from-2016-09-30-13-34-23

No significant improvements or regressions noted for tsvgx or tp4m.

Autophone

Throbber Start / Throbber Stop

Browser startup performance is measured on real phones (a variety of popular devices).

Here’s a quick summary for the local blank page test on various devices:

screenshot-from-2016-09-30-13-51-37

Again, there is an excellent performance improvement with bug 1291424. Yahoo!

See bug 953342 to track autophone throbber regressions (none this quarter).

Firefox for Android Performance Measures – Q2 Check-up

Tags

,

Highlights:

  • gradual increases in APK size and memory use
  • not much change in tsvgx or tp4m
  • autophone throbber data available in perfherder

APK Size

You can see the size of every build on treeherder using Perfherder.

Here’s how the APK size changed over the quarter, for mozilla-central Android 4.0 API15+ opt builds:

 

apk-size

APK size generally grew, generally in small increments. Our APK is about 1.3 MB larger today than it was 3 months ago. The largest increase, of about 400 KB around May 4, was caused by and discussed in bug 1260208. The largest decrease, of about 200 KB around April 25, was caused by bug 1266102.

For the same period, libxul.so also generally grew gradually:

libxul

 

Memory

We track some memory metrics using test_awsy_lite.

awsy

These memory measurements are fairly steady over the quarter, with a gradual increase over time.

Autophone-Talos

This section tracks Perfherder graphs for mozilla-central builds of Firefox for Android, for Talos tests run on Autophone, on android-6-0-armv8-api15. The test names shown are those used on treeherder. See https://wiki.mozilla.org/Buildbot/Talos for background on Talos.

tsvgx

An svg-only number that measures SVG rendering performance. About half of the tests are animations or iterations of rendering. This ASAP test (tsvgx) iterates in unlimited frame-rate mode thus reflecting the maximum rendering throughput of each test. The reported value is the page load time, or, for animations/iterations – overall duration the sequence/animation took to complete. Lower values are better.

tp4m

Generic page load test. Lower values are better.

talos

No significant improvements or regressions noted for tsvgx or tp4m.

Autophone

Throbber Start / Throbber Stop

Browser startup performance is measured on real phones (a variety of popular devices).

For the first time on this blog, I’ve pulled this graph from Perfherder, rather than phonedash. A wealth of throbber start/throbber stop data is now available in Perfherder. Here’s a quick summary for the local blank page test on various devices:

throbber

See bug 953342 to track autophone throbber regressions.

 

Firefox for Android Performance Measures – Q1 Check-up

Tags

,

Highlights:

  • APK size reduction for downloadable fonts
  • now measuring memory via test_awsy_lite
  • tsvgx and tp4m moved to Autophone

APK Size

You can see the size of every build on treeherder using Perfherder.

Here’s how the APK size changed over the quarter, for mozilla-central Android 4.0 API15+ opt builds:

apk

The dramatic decrease in February was caused by bug 1233799, which enabled the download content service and removed fonts from the APK.

For the same period, libxul.so generally increased in size:

libxul

The recent decrease in libxul was caused by bug 1259521, an upgrade of the Android NDK.

Memory

This quarter we began tracking some memory metrics, using test_awsy_lite.

awsy

These memory measurements are generally steady over the quarter, with some small improvements.

Autophone-Talos

This section tracks Perfherder graphs for mozilla-central builds of Firefox for Android, for Talos tests run on Autophone, on android-6-0-armv8-api15. The test names shown are those used on treeherder. See https://wiki.mozilla.org/Buildbot/Talos for background on Talos.

In previous quarters, these tests were running on Pandaboards; beginning this quarter, these tests run on actual phones via Autophone.

tsvgx

An svg-only number that measures SVG rendering performance. About half of the tests are animations or iterations of rendering. This ASAP test (tsvgx) iterates in unlimited frame-rate mode thus reflecting the maximum rendering throughput of each test. The reported value is the page load time, or, for animations/iterations – overall duration the sequence/animation took to complete. Lower values are better.

tp4m

Generic page load test. Lower values are better.

talos

No significant improvements or regressions noted for tsvgx or tp4m.

Autophone

Throbber Start / Throbber Stop

These graphs are taken from http://phonedash.mozilla.org.  Browser startup performance is measured on real phones (a variety of popular devices).

throbberstart

throbstop

There was a lot of work on Autophone this quarter, with new devices added and old devices retired or re-purposed. These graphs show devices running mozilla-central builds, of which none were in continuous use over the quarter.

Throbber Start/Stop test regressions are tracked by bug 953342; a recent regression in throbber start is under investigation in bug 1259479.

mozbench

mozbench has been retired.😦

Long live arewefastyet.com!🙂 I’ll check in on arewefastyet.com next quarter.

Reduce, reuse, recycle

Tags

, ,

As Firefox for Android drops support for ancient versions of Android, I find my collection of test phones becoming less and less relevant. For instance, I have a Galaxy S that works fine but only runs Android 2.2.1 (API 8), and I have a Galaxy Nexus that runs Android 4.0.1 (API 14). I cannot run current builds of Firefox for Android on either phone, and, perhaps because I rooted them or otherwise messed around with them in the distant past, neither phone will upgrade to a newer version of Android.

I have been letting these phones gather dust while I test on emulators, but I recently needed a real phone and managed to breathe new life into the Galaxy Nexus using an AOSP build. I wanted all the development bells and whistles and a root shell, so I made a full-eng build and I updated the Galaxy Nexus to Android 4.3 (api 18) — good enough for Firefox for Android, at least for a while!

Basically, I followed the instructions at https://source.android.com/source/requirements.html, building on Ubuntu 14.04. For the Galaxy Nexus, that broke down to:

mkdir aosp
cd aosp
repo init -u https://android.googlesource.com/platform/manifest -b android-4.3_r1 # Galaxy Nexus
repo sync (this can take several hours)
# Download all binaries from the relevant section of 
#   https://developers.google.com/android/nexus/drivers .
# I used "Galaxy Nexus (GSM/HSPA+) binaries for Android 4.3 (JWR66Y)".
# Extract each (6x) downloaded archive, extracting into <aosp>.
# Execute each (6x) .sh and accept prompts, populating <aosp>/vendor.
source build/envsetup.sh
lunch full_maguro-eng
# use update-alternatives to select Java 6; I needed all 5 of these
sudo update-alternatives --config java
sudo update-alternatives --config javac
sudo update-alternatives --config javah
sudo update-alternatives --config javadoc
sudo update-alternatives --config javap
make -j4 (this can take a couple of hours)

Once make completes, I had binaries in <aosp>/out/… I put the phone in bootloader mode (hold down Volume Up + Volume Down + Power to boot Galaxy Nexus), connected it by  USB and executed “fastboot -w flashall”.

Actually, in my case, fastboot could not see the connected device, unless I ran it from root. In the root account, I didn’t have the right settings, so I needed to do something like:

sudo /bin/bash
source build/envsetup.sh
lunch full_maguro-eng
fastboot -w flashall
exit

If you are following along, don’t forget to undo your java update-alternatives when you are done!

It took some time to download and build, but the procedure was fairly straight-forward and the results excellent: I feel like I have a new phone, perfectly clean and functional — and rooted!

(I have had no similar luck with the Galaxy S: AOSP binaries are only supplied for Nexus devices, and I see no AOSP instructions for the Galaxy S. Maybe it’s time to recycle this one.)

test_awsy_lite

Tags

,

Bug 1233220 added a new Android-only mochitest-chrome test called test_awsy_lite.html. Inspired by https://www.areweslimyet.com/mobile/, test_awsy_lite runs similar code and takes similar measurements to areweslimyet.com, but runs as a simple mochitest and reports results to Perfherder.

There are some interesting trade-offs to this approach to performance testing, compared to running a custom harness like areweslimyet.com or Talos.

+ Writing and adding a mochitest is very simple.

+ It is easy to report to Perfherder (see http://wrla.ch/blog/2015/11/perfherder-onward/).

+ Tests can be run locally to reproduce and debug test failures or irregularities.

+ There’s no special hardware to maintain. This is a big win compared to ad-hoc systems that might fail because someone kicks the phone hanging off the laptop that’s been tucked under their desk, or because of network changes, or failing hardware. areweslimyet.com/mobile was plagued by problems like this and hasn’t produced results in over a year.

? Your new mochitest is automatically run on every push…unless the test job is coalesced or optimized away by SETA.

? Results are tracked in Perfherder. I am a big fan of Perfherder and think it has a solid UI that works for a variety of data (APK sizes, build times, Talos results). I expect Perfherder will accommodate test_awsy_lite data too, but some comparisons may be less convenient to view in Perfherder compared to a custom UI, like areweslimyet.com.

– For Android, mochitests are run only on Android emulators, running on aws. That may not be representative of performance on real phones — but I’m hoping memory use is similar on emulators.

– Tests cannot run for too long. Some Talos and other performance tests run many iterations or pause for long periods of time, resulting in run-times of 20 minutes or more. Generally, a mochitest should not run for that long and will probably cause some sort of timeout if it does.

For test_awsy_lite.html, I took a few short-cuts, worth noting:

  •  test_awsy_lite only reports “Resident memory” (RSS); other measurements like “Explicit memory” should be easy to add;
  •  test_awsy_lite loads fewer pages than areweslimyet.com/mobile, to keep run-time manageable; it runs in about 10 minutes, using about 6.5 minutes for page loads.

Results are in Perfherder. Add data for “android-2-3-armv7-api9” or “android-4-3-armv7-api15” and you will see various tests named “Resident Memory …”, each corresponding to a traditional areweslimyet.com measurement.

perfh

Firefox for Android Performance Measures – Q4 Check-up

Tags

,

Highlights:

  •  now measuring APK size
  •  tcheck2 (temporarily) retired
  •  tsvgx and tp4m improved – thanks :jchen!

 

APK Size

This quarter we began tracking the size of the Firefox for Android APK, and some of its components. You can see the size of every build on treeherder using Perfherder.

Here’s how the APK size changed over the last 2 months, for mozilla-central Android 4.0 opt builds:

apksize

There are lots of increases and a few decreases here. The most significant decrease (almost half a megabyte) is on Nov 23, from mfinkle’s change for Bug 1223526. The most significant increase (~200K) is on Dec 20, from a Skia update, Bug 1082598.

It is worth noting that the sizes of libxul.so over the same period were almost always increasing:

libxul

Talos

This section tracks Perfherder graphs for mozilla-central builds of Firefox for Android, for Talos tests run on Android 4.0 Opt. The test names shown are those used on treeherder. See https://wiki.mozilla.org/Buildbot/Talos for background on Talos.

We intend to retire the remaining Android Talos tests, migrating these tests to autophone in the very near future.

tcheck2

Measure of “checkerboarding” during simulation of real user interaction with page. Lower values are better.

This test is no longer running. It was noisy and needed to be rewritten for APZ. See discussion in bug 1213032 and bug 1230572.

tsvgx

An svg-only number that measures SVG rendering performance. About half of the tests are animations or iterations of rendering. This ASAP test (tsvgx) iterates in unlimited frame-rate mode thus reflecting the maximum rendering throughput of each test. The reported value is the page load time, or, for animations/iterations – overall duration the sequence/animation took to complete. Lower values are better.

tsvg

730 (start of period) – 110 (end of period)

A small regression at the end of November corresponded with the introduction of APZ; it was investigated in bug 1229118. An extraordinary improvement on Dec 25 was the result of jchen’s refactoring.

tp4m

Generic page load test. Lower values are better.

tp4

730 (start of period) – 680 (end of period)

Note the same regression and improvement as seen in tsvgx.

Autophone

Throbber Start / Throbber Stop

These graphs are taken from http://phonedash.mozilla.org.  Browser startup performance is measured on real phones (a variety of popular devices).

throbstart

throbstop

Eideticker

Android tests are no longer run on Eideticker.

mozbench

These graphs are taken from the mozbench dashboard at http://ouija.allizom.org/grafana/index.html#/dashboard/file/mozbench.json which includes some comparisons involving Firefox for Android. More info at https://wiki.mozilla.org/Auto-tools/Projects/Mozbench.

bench1

Sadly, the other mobile benchmarks have no data for most of November and December…I’m not sure why.