Benchmark is overstating precision of the rankings #803

adamhaile · 2020-09-29T22:25:31Z

Creating an Issue out of the discussion opened here: #772 (comment) .

Currently, this benchmark ranks frameworks using the full precision of the test results. Since there is considerable run-to-run variation in test results (est. from experience ~2% in the geometric mean statistic) and since this run-to-run variation is greater than the actual differences between many frameworks, the rankings are highly unstable. An example would be Surplus, which jumped from 6th in the Chrome 83 rankings to 1st in the Chrome 84 rankings despite no changes to the framework or the implementation (or, as far as I know, relevant changes in the Chrome implementation).

Consequently, this benchmark is claiming a capability -- full ordering of the frameworks -- which it does not in fact have the statistical power to provide. To put it simply, the individual rankings are a fiction.

A secondary issue is whether precision = accuracy, aka whether the rankings represent something "true" or "real" about each framework. Because of the usefulness of this benchmark for implementers, there is risk that implementers may over-fit their frameworks to these tests, resulting in implementations that score better here but are less useful or performant for web development in general. This risk is compounded by the fact that the test requirements inherently cannot be fully defined, relying on a considerable amount of discussion about what's "cheating" and what's "ok." This risk should incline us to be conservative about what differences are presented as meaningful in the rankings.

As a potential solution to the above issues, the proposal here is to:

Truncate test results to a single digit of precision before ranking frameworks. This is more conservative than the observed ~2% variation due to the second issue listed above regarding whether precision = accuracy.
Frameworks scoring the same after truncation are and should be represented as "tied." Sub-ordering within ties should be on a non-meaningful quality, such as alphabetic or random.
Implementers should be able to turn on greater precision -- proposal would be 3 digits -- for their own testing. This mode should be given a name -- proposal "meaningless mode" -- to indicate that it should be taken with a grain of salt.

Alternate Proposals

Leave rankings as they are and let people use the "Compare results against one framework" tool to tell which differences are statistically meaningful.

Counter-argument: While a very useful tool and also one that is much more statistically sound, providing it does not change the fact that the ranking's primary presentation is a fiction. We should revise that presentation to something supportable.

Use a more sophisticated statistical method than single-digit truncation to group ties

Counter-argument: Because it's difficult to decide how meaningful these results are beyond the benchmark, we should be simple with our methods rather than (too) clever. But if someone has a good idea here, by all means, bring it up.

10% is too large an interval, we should pick something smaller

Counter-argument: Maybe. Current rankings are equivalent to a bucket size of 0%. The smaller the bucket size, the greater the instability.

Thoughts?

hville · 2020-09-30T00:41:10Z

Any reason the default is not the median instead of the mean? That would right away stabilize things by discarding the odd freak runs. However, it would also likely benefit implementation with a higher variance since I guess results are heavy skewed with a long tail of garbage collection bouts.

krausest · 2020-10-05T19:55:36Z

Here's my proposal for a better results table:
https://krausest.github.io/js-framework-benchmark/2020/new_result_table.html

I hope the "compare" link below the geometric average is prominent enough. Another change is that it defaults to using the median value. It also filters all implementations with errors by default.

I thought about truncating the result but I'm really concerned that some frameworks will flip between groups. Of course the authors of those frameworks will be very unhappy and will object that the grouping creates a wrong impression of precision.

I really hope the new compare links makes it easier to compare frameworks based on statistic data (let's just hope the statistics are calculated correctly...). A framework is considered faster than another if it's faster for at least one benchmark and not slower for the rest.

krausest · 2020-10-21T17:34:19Z

@adamhaile I'm closing this issue and will go with the enhanced results table.

krausest closed this as completed Oct 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark is overstating precision of the rankings #803

Benchmark is overstating precision of the rankings #803

adamhaile commented Sep 29, 2020 •

edited

Loading

hville commented Sep 30, 2020

krausest commented Oct 5, 2020

krausest commented Oct 21, 2020

Benchmark is overstating precision of the rankings #803

Benchmark is overstating precision of the rankings #803

Comments

adamhaile commented Sep 29, 2020 • edited Loading

Alternate Proposals

Leave rankings as they are and let people use the "Compare results against one framework" tool to tell which differences are statistically meaningful.

Use a more sophisticated statistical method than single-digit truncation to group ties

10% is too large an interval, we should pick something smaller

hville commented Sep 30, 2020

krausest commented Oct 5, 2020

krausest commented Oct 21, 2020

adamhaile commented Sep 29, 2020 •

edited

Loading