New try aliases “xpcshell” and “robocop”

Tags

, ,

We now have two new try aliases which will be of interest to some mobile developers.

Android 2.3 tests run xpcshell tests in 3 chunks, which can be specified in a try push:

try: … -u xpcshell-1,xpcshell-2,xpcshell-3

but since all other test platforms run xpcshell as a single chunk, it’s easy to forget about Android 2.3′s chunks and push something like:

try: -b o -p all -u xpcshell -t none

…and then wonder why xpcshell tests didn’t run for Android 2.3!

As of today, a new try alias recognizes “xpcshell” to mean “run all the xpcshell test chunks”.

Similarly, a new try alias recognizes “robocop” to mean “run all the robocop test chunks”.

An example: https://tbpl.mozilla.org/?tree=Try&rev=e52bcf945dcd

tryaliases

How convenient!

(Of course, “-u xpcshell-1″, “-u robocop-2, robocop-3″, etc still work and you should use them if you only need to run specific chunks.)

Thanks to :Callek and :RyanVM for making this happen.

Firefox for Android Performance Measures – June check-up

Tags

, ,

My monthly review of Firefox for Android performance measurements. June highlights:

- Talos values tracked here switch to Android 4.0, rather than Android 2.2

- Talos regressions in tcheck2 and tsvgx

- small regression in time to throbber stop

- Eideticker still not reporting results.

Talos

This section tracks Perfomatic graphs from graphs.mozilla.org for mozilla-central builds of Native Fennec. In all of my previous posts, this section has tracked Talos for Android 2.2 Opt. This month, and going forward, I switch to Android 4.0 Opt, since the Android 2.2 Opt tests are being phased out. The test names shown are those used on tbpl. See https://wiki.mozilla.org/Buildbot/Talos for background on Talos.

tcanvasmark

This test is not currently run on Android 4.0.

tcheck2

Measure of “checkerboarding” during simulation of real user interaction with page. Lower values are better.

tcheck2

6 (start of period) – 12 (end of period)

Regression of June 17 – bug 1026742.

trobopan

Panning performance test. Value is square of frame delays (ms greater than 25 ms) encountered while panning. Lower values are better.

50000 (start of period) – 50000 (end of period)

There was a large temporary regression between June 12 and June 14 – bug 1026798.

tprovider

Performance of history and bookmarks’ provider. Reports time (ms) to perform a group of database operations. Lower values are better.

520 (start of period) – 520 (end of period).

tsvgx

An svg-only number that measures SVG rendering performance. About half of the tests are animations or iterations of rendering. This ASAP test (tsvgx) iterates in unlimited frame-rate mode thus reflecting the maximum rendering throughput of each test. The reported value is the page load time, or, for animations/iterations – overall duration the sequence/animation took to complete. Lower values are better.

6100 (start of period) – 6300 (end of period).

Regression of June 16 – bug 1026551.

tp4m

Generic page load test. Lower values are better.

940 (start of period) – 940 (end of period).

ts_paint

Startup performance test. Lower values are better.

3600 (start of period) – 3600 (end of period).

Throbber Start / Throbber Stop

These graphs are taken from http://phonedash.mozilla.org.  Browser startup performance is measured on real phones (a variety of popular devices).

throbstart

 

throbstop

“Time to throbber start” looks very flat for all devices, but “Time to throbber stop” has a slight upward trend, especially for nexus-s-2 — bug 1032249.

Eideticker

These graphs are taken from http://eideticker.mozilla.org. Eideticker is a performance harness that measures user perceived performance of web browsers by video capturing them in action and subsequently running image analysis on the raw result.

More info at: https://wiki.mozilla.org/Project_Eideticker

Eideticker results are still not available. We’ll check back at the end of July.

Firefox for Android Performance Measures – May check-up

Tags

,

My monthly review of Firefox for Android performance measurements. May highlights:

- slight regressions in tcanvasmark and trobopan

- small regression in time to throbber stop

- Eideticker still not reporting results.

Talos

This section tracks Perfomatic graphs from graphs.mozilla.org for mozilla-central builds of Native Fennec (Android 2.2 opt). The test names shown are those used on tbpl. See https://wiki.mozilla.org/Buildbot/Talos for background on Talos.

tcanvasmark

This test runs the third-party CanvasMark benchmark suite, which measures the browser’s ability to render a variety of canvas animations at a smooth framerate as the scenes grow more complex. Results are a score “based on the length of time the browser was able to maintain the test scene at greater than 30 FPS, multiplied by a weighting for the complexity of each test type”. Higher values are better.

Image

6300 (start of period) – 5700 (end of period).

Regression of May 12 – bug 1009646.

tcheck2

Measure of “checkerboarding” during simulation of real user interaction with page. Lower values are better.

9 (start of period) – 9 (end of period)

trobopan

Panning performance test. Value is square of frame delays (ms greater than 25 ms) encountered while panning. Lower values are better.

Image

110000 (start of period) – 130000 (end of period)

This regression just happened today and has not triggered a Talos alert yet — I don’t have a bug number yet.

tprovider

Performance of history and bookmarks’ provider. Reports time (ms) to perform a group of database operations. Lower values are better.

425 (start of period) – 425 (end of period).

tsvgx

An svg-only number that measures SVG rendering performance. About half of the tests are animations or iterations of rendering. This ASAP test (tsvgx) iterates in unlimited frame-rate mode thus reflecting the maximum rendering throughput of each test. The reported value is the page load time, or, for animations/iterations – overall duration the sequence/animation took to complete. Lower values are better.

7300 (start of period) – 7300 (end of period).

tp4m

Generic page load test. Lower values are better.

750 (start of period) – 750 (end of period).

ts_paint

Startup performance test. Lower values are better.

3600 (start of period) – 3600 (end of period).

Throbber Start / Throbber Stop

These graphs are taken from http://phonedash.mozilla.org.  Browser startup performance is measured on real phones (a variety of popular devices).

Image

Image

The improvement on May 2 was due to a change in the test setup (sut vs adb).

The small regression of May 11 is tracked in bug 1018463.

Eideticker

These graphs are taken from http://eideticker.mozilla.org. Eideticker is a performance harness that measures user perceived performance of web browsers by video capturing them in action and subsequently running image analysis on the raw result.

More info at: https://wiki.mozilla.org/Project_Eideticker

Eideticker results are still not available. We’ll check back at the end of June.

Firefox for Android Performance Measures – April check-up

Tags

,

My monthly review of Firefox for Android performance measurements. April highlights:

- No Talos regressions, no Throbber Start/Stop regressions.

- tcheck2 improvement.

- Eideticker still not reporting results.

Talos

This section tracks Perfomatic graphs from graphs.mozilla.org for mozilla-central builds of Native Fennec (Android 2.2 opt). The test names shown are those used on tbpl. See https://wiki.mozilla.org/Buildbot/Talos for background on Talos.

tcanvasmark

This test runs the third-party CanvasMark benchmark suite, which measures the browser’s ability to render a variety of canvas animations at a smooth framerate as the scenes grow more complex. Results are a score “based on the length of time the browser was able to maintain the test scene at greater than 30 FPS, multiplied by a weighting for the complexity of each test type”. Higher values are better.

6300 (start of period) – 6300 (end of period).

tcheck2

Measure of “checkerboarding” during simulation of real user interaction with page. Lower values are better.

Image

24 (start of period) – 9 (end of period)

Note significant improvement and noise reduction.

trobopan

Panning performance test. Value is square of frame delays (ms greater than 25 ms) encountered while panning. Lower values are better.

110000 (start of period) – 110000 (end of period)

tprovider

Performance of history and bookmarks’ provider. Reports time (ms) to perform a group of database operations. Lower values are better.

425 (start of period) – 425 (end of period).

tsvgx

An svg-only number that measures SVG rendering performance. About half of the tests are animations or iterations of rendering. This ASAP test (tsvgx) iterates in unlimited frame-rate mode thus reflecting the maximum rendering throughput of each test. The reported value is the page load time, or, for animations/iterations – overall duration the sequence/animation took to complete. Lower values are better.

7300 (start of period) – 7300 (end of period).

ts_paint

Startup performance test. Lower values are better.

3600 (start of period) – 3600 (end of period).

Throbber Start / Throbber Stop

These graphs are taken from http://phonedash.mozilla.org.  Browser startup performance is measured on real phones (a variety of popular devices).

Image

Image

No regressions noted this month.

Eideticker

These graphs are taken from http://eideticker.mozilla.org. Eideticker is a performance harness that measures user perceived performance of web browsers by video capturing them in action and subsequently running image analysis on the raw result.

More info at: https://wiki.mozilla.org/Project_Eideticker

Eideticker results are still not available. We’ll check back at the end of May.

Android 2.3 Opt tests on tbpl

Tags

,

Today, we started running some Android 2.3 Opt tests on tbpl:

Image

“Android 2.3 Opt” tests run on emulators running Android 2.3. The emulator is simply the Android arm emulator, taken from the Android SDK (version 18). The emulator runs a special build of Gingerbread (2.3.7), patched and built specifically to support our Android tests. The emulator is running on an aws ec2 host. Android 2.3 Opt runs one emulator at a time on a host (unlike the Android x86 emulator tests, which run up to 4 emulators concurrently on one ix host).

Android 2.3 Opt tests generally run slower than tests run on devices. We have found that tests will run faster on faster hosts; for instance, if we run the emulator on an aws m3.large instance (more memory, more cpu), mochitests run in about 1/3 of the time that they do currently, on m1.medium instances.

Reftests – plain reftests, js reftests, and crashtests – run particularly slowly. In fact, they take so long that we cannot run them to completion with a reasonable number of test chunks. We are investigating more and also considering the simple solution: running on different hosts.

We have no plans to run Talos tests on Android 2.3 Opt; we think there is limited value in running performance tests on emulators.

Android 2.3 Opt tests are supported on try — “try: -b o -p android …” You can also request that a slave be loaned to you for debugging more intense problems: https://wiki.mozilla.org/ReleaseEngineering/How_To/Request_a_slave. In my experience, these methods – try and slave loans – are more effective at reproducing test results than running an emulator locally: The host seems to affect the emulator’s behavior in significant and unpredictable ways.

Once the Android 2.3 Opt tests are running reliably, we hope to stop the corresponding tests on Android 2.2 Opt, reducing the burden on our old and limited population of Tegra boards.

As with any new test platform, we had to disable some tests to get a clean run suitable for tbpl. These are tracked in bug 979921.

There are also a few unresolved issues causing infrequent problems in active tests. These are tracked in bug 967704.

Firefox for Android Performance Measures – March check-up

Tags

,

My monthly review of Firefox for Android performance measurements. March highlights:

- 3 throbber start/stop regressions

- Eideticker not reporting results for the last couple of weeks.

Talos

This section tracks Perfomatic graphs from graphs.mozilla.org for mozilla-central builds of Native Fennec (Android 2.2 opt). The test names shown are those used on tbpl. See https://wiki.mozilla.org/Buildbot/Talos for background on Talos.

tcanvasmark

This test runs the third-party CanvasMark benchmark suite, which measures the browser’s ability to render a variety of canvas animations at a smooth framerate as the scenes grow more complex. Results are a score “based on the length of time the browser was able to maintain the test scene at greater than 30 FPS, multiplied by a weighting for the complexity of each test type”. Higher values are better.

7200 (start of period) – 6300 (end of period).

Regression of March 5 – bug 980423 (disable skia-gl).

tcheck2

Measure of “checkerboarding” during simulation of real user interaction with page. Lower values are better.

24 (start of period) – 24 (end of period)

trobopan

Panning performance test. Value is square of frame delays (ms greater than 25 ms) encountered while panning. Lower values are better.

110000 (start of period) – 110000 (end of period)

tprovider

Performance of history and bookmarks’ provider. Reports time (ms) to perform a group of database operations. Lower values are better.

375 (start of period) – 425 (end of period).

Regression of March 29 – bug 990101. (test modified)

tsvgx

An svg-only number that measures SVG rendering performance. About half of the tests are animations or iterations of rendering. This ASAP test (tsvgx) iterates in unlimited frame-rate mode thus reflecting the maximum rendering throughput of each test. The reported value is the page load time, or, for animations/iterations – overall duration the sequence/animation took to complete. Lower values are better.

7600 (start of period) – 7300 (end of period).

tp4m

Generic page load test. Lower values are better.

710 (start of period) – 750 (end of period).

No specific regression identified.

ts_paint

Startup performance test. Lower values are better.

3600 (start of period) – 3600 (end of period).

Throbber Start / Throbber Stop

These graphs are taken from http://phonedash.mozilla.org.  Browser startup performance is measured on real phones (a variety of popular devices).

3 regressions were reported this month: bug 980757, bug 982864, bug 986416.

:bc continued his work on noise reduction in March. Changes in the test setup have likely affected the phonedash graphs this month. We’ll check back at the end of April.

Eideticker

These graphs are taken from http://eideticker.mozilla.org. Eideticker is a performance harness that measures user perceived performance of web browsers by video capturing them in action and subsequently running image analysis on the raw result.

More info at: https://wiki.mozilla.org/Project_Eideticker

Eideticker results for the last couple of weeks are not available. We’ll check back at the end of April.

Android 4.0 Debug tests on tbpl

Tags

,

Today, we started running some Android 4.0 Debug mochitests and js-reftests on tbpl.

Screenshot from 2014-03-31 12:13:10b

“Android 4.0 Debug” tests run on Pandaboards running Android 4.0, just like the existing “Android 4.0 Opt” tests which have been running for some time. Unlike the Opt tests, the Debug tests run debug builds, with more log messages than Opt and notably, assertions. The “complete logcats” can be very useful for these jobs — see Complete logcats for Android tests.

Other test suites – the remaining mochitests chunks, robocop, reftests, etc – run on Android 4.0 Debug only on the Cedar tree at this time. They mostly work, but have failures that make them too unreliable to run on trunk trees. Would you like to see more Android 4.0 Debug tests running? A few test failures are all that is blocking us from running the remainder of our test suites. See Bug 940068 for the list of known failures.

 

Firefox for Android Performance Measures – February check-up

Tags

,

My monthly review of Firefox for Android performance measurements.

February highlights:

- Regressions in tcanvasmark, tcheck2, and tsvgx; improvement in ts-paint.

- Improvements in some eideticker startup measurements.

Talos

This section tracks Perfomatic graphs from graphs.mozilla.org for mozilla-central builds of Native Fennec (Android 2.2 opt). The test names shown are those used on tbpl. See https://wiki.mozilla.org/Buildbot/Talos for background on Talos.

tcanvasmark

This test runs the third-party CanvasMark benchmark suite, which measures the browser’s ability to render a variety of canvas animations at a smooth framerate as the scenes grow more complex. Results are a score “based on the length of time the browser was able to maintain the test scene at greater than 30 FPS, multiplied by a weighting for the complexity of each test type”. Higher values are better.

7800 (start of period) – 7200 (end of period).

Regression on Feb 19 – bug 978958.

tcheck2

Measure of “checkerboarding” during simulation of real user interaction with page. Lower values are better.

tcheck2

2.7 (start of period) – 24 (end of period)

Regression of Feb 25: bug 976563.

trobopan

Panning performance test. Value is square of frame delays (ms greater than 25 ms) encountered while panning. Lower values are better.

110000 (start of period) – 110000 (end of period)

tprovider

Performance of history and bookmarks’ provider. Reports time (ms) to perform a group of database operations. Lower values are better.

375 (start of period) – 375 (end of period).

tsvgx

An svg-only number that measures SVG rendering performance. About half of the tests are animations or iterations of rendering. This ASAP test (tsvgx) iterates in unlimited frame-rate mode thus reflecting the maximum rendering throughput of each test. The reported value is the page load time, or, for animations/iterations – overall duration the sequence/animation took to complete. Lower values are better.

7500 (start of period) – 7600 (end of period).

This test both improved and regressed slightly over the month, for a slight overall regression. Bug 978878.

tp4m

Generic page load test. Lower values are better.

700 (start of period) – 710 (end of period).

No specific regression identified.

ts_paint

Startup performance test. Lower values are better.

4300 (start of period) – 3600 (end of period).

Throbber Start / Throbber Stop

These graphs are taken from http://phonedash.mozilla.org.  Browser startup performance is measured on real phones (a variety of popular devices).

“Time to throbber start” measures the time from process launch to the start of the throbber animation. Smaller values are better.

throbberstart

“Time to throbber stop” measures the time from process launch to the end of the throbber animation. Smaller values are better.

throbberstop

:bc has been working on reducing noise in these results — notice the improvement. And there is more to come!

Eideticker

These graphs are taken from http://eideticker.mozilla.org. Eideticker is a performance harness that measures user perceived performance of web browsers by video capturing them in action and subsequently running image analysis on the raw result.

More info at: https://wiki.mozilla.org/Project_Eideticker

There is some improvement from last month’s startup measurements:

eide1

eide2

eide3

awsy

See https://www.areweslimyet.com/mobile/ for content and background information.

I did not notice any big changes this month.

Complete logcats for Android tests

Tags

,

“Logcats” – those Android logs you see when you execute “adb logcat” – are an essential part of debugging Firefox for Android. For a long time, we have included logcats in our Android test logs on tbpl: After a test run, we run logcat on the device, collect the output and dump it to the test log. Sometimes those logcats are very useful; other times, they are too little, too late. A typical problem is that a failure occurs early in a test run, but does not cause the test to fail immediately; by the time the test ends, the fixed-size logcat buffer has filled up and overwritten the earlier, important messages. How frustrating!

Now Android 2.3, Android 4.0 and Android 4.2 x86 test jobs offer “complete logcats”: logcat is run for the duration of the test job, the output is collected continuously, and dumped to a file. At the end of the test job, the file is uploaded to an aws server, and a link is displayed in tbpl. Here’s a sample of a tbpl summary:

blobber-logcat

Notice the (blobuploader) line? Open that link and you have a complete logcat showing what was happening on the device for the duration of the test job.

We have not changed the “old” logcat features in test logs: We still run logcat at the end of most jobs and dump the output to the test log. That might be more convenient in some cases.

Are you wondering what “blobuploader” means? Curious about how the aws upload works? That’s the “blobber” project at work. See http://atlee.ca/posts/blobber-is-live.html and https://air.mozilla.org/intern-presentation-tabara/.

Unfortunately, the Android 2.2 (Tegra) test jobs use an older infrastructure which makes it difficult to implement blobber and complete logcats. There are no logcats-via-blobber for Android 2.2 — it’s only available for Android 4.0 and the newer Android emulator tests.

Happy test debugging!

Firefox for Android Performance Measures – January check-up

Tags

, ,

My monthly review of Firefox for Android performance measurements.

January highlights:

- only minor Talos regressions

- Eideticker startup regressions

- inconsistent improvement in many awsy measures

Talos

This section tracks Perfomatic graphs from graphs.mozilla.org for mozilla-central builds of Native Fennec (Android 2.2 opt). The test names shown are those used on tbpl. See https://wiki.mozilla.org/Buildbot/Talos for background on Talos.

tcanvasmark

This test runs the third-party CanvasMark benchmark suite, which measures the browser’s ability to render a variety of canvas animations at a smooth framerate as the scenes grow more complex. Results are a score “based on the length of time the browser was able to maintain the test scene at greater than 30 FPS, multiplied by a weighting for the complexity of each test type”. Higher values are better.

7800 (start of period) – 7800 (end of period).

tcheck2

Measure of “checkerboarding” during simulation of real user interaction with page. Lower values are better.

rck2

2.5 (start of period) – 2.7 (end of period)

Jan 16 regression – bug 961869.

trobopan

Panning performance test. Value is square of frame delays (ms greater than 25 ms) encountered while panning. Lower values are better.

110000 (start of period) – 110000 (end of period)

tprovider

Performance of history and bookmarks’ provider. Reports time (ms) to perform a group of database operations. Lower values are better.

375 (start of period) – 375 (end of period).

tsvgx

An svg-only number that measures SVG rendering performance. About half of the tests are animations or iterations of rendering. This ASAP test (tsvgx) iterates in unlimited frame-rate mode thus reflecting the maximum rendering throughput of each test. The reported value is the page load time, or, for animations/iterations – overall duration the sequence/animation took to complete. Lower values are better.

svg

7200 (start of period) – 7500 (end of period).

Regression of Jan 7 – bug 958129.

tp4m

Generic page load test. Lower values are better.

700 (start of period) – 700 (end of period).

ts_paint

Startup performance test. Lower values are better.

4300 (start of period) – 4300 (end of period).

Throbber Start / Throbber Stop

These graphs are taken from http://phonedash.mozilla.org.  Browser startup performance is measured on real phones (a variety of popular devices).

“Time to throbber start” measures the time from process launch to the start of the throbber animation. Smaller values are better.

throbber_start

There is so much data here, it is hard to see what is happening – bug 967052. I filtered out many of the devices to get this:

throbber_start-2

I think existing, long-running devices are showing no regressions, and some of the new devices are exhibiting a lot of noise — a problem that :bc is working to correct.

“Time to throbber stop” measures the time from process launch to the end of the throbber animation. Smaller values are better.

throbber_stop

A similar story here, I think.

But there was a regression for some devices on Jan 24 – bug 964323.

Eideticker

These graphs are taken from http://eideticker.mozilla.org. Eideticker is a performance harness that measures user perceived performance of web browsers by video capturing them in action and subsequently running image analysis on the raw result.

More info at: https://wiki.mozilla.org/Project_Eideticker

Let’s look at our startup numbers this month:

eide1 eide2 eide3 eide4 eide5 eide6 eide7 eide8

Regressions noted in bugs 964307 and 966580.

awsy

See https://www.areweslimyet.com/mobile/ for content and background information.

awsy1

awsy2

awsy3

There seems to be an improvement in several of the measurements, but it is inconsistent — it varies from one test run to the next. I wonder what that’s about.

Follow

Get every new post delivered to your Inbox.