Skip to content

Benchmark is overstating precision of the rankings #803

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
adamhaile opened this issue Sep 29, 2020 · 3 comments
Closed

Benchmark is overstating precision of the rankings #803

adamhaile opened this issue Sep 29, 2020 · 3 comments

Comments

@adamhaile
Copy link
Contributor

adamhaile commented Sep 29, 2020

Creating an Issue out of the discussion opened here: #772 (comment) .

Currently, this benchmark ranks frameworks using the full precision of the test results. Since there is considerable run-to-run variation in test results (est. from experience ~2% in the geometric mean statistic) and since this run-to-run variation is greater than the actual differences between many frameworks, the rankings are highly unstable. An example would be Surplus, which jumped from 6th in the Chrome 83 rankings to 1st in the Chrome 84 rankings despite no changes to the framework or the implementation (or, as far as I know, relevant changes in the Chrome implementation).

Consequently, this benchmark is claiming a capability -- full ordering of the frameworks -- which it does not in fact have the statistical power to provide. To put it simply, the individual rankings are a fiction.

A secondary issue is whether precision = accuracy, aka whether the rankings represent something "true" or "real" about each framework. Because of the usefulness of this benchmark for implementers, there is risk that implementers may over-fit their frameworks to these tests, resulting in implementations that score better here but are less useful or performant for web development in general. This risk is compounded by the fact that the test requirements inherently cannot be fully defined, relying on a considerable amount of discussion about what's "cheating" and what's "ok." This risk should incline us to be conservative about what differences are presented as meaningful in the rankings.

As a potential solution to the above issues, the proposal here is to:

  • Truncate test results to a single digit of precision before ranking frameworks. This is more conservative than the observed ~2% variation due to the second issue listed above regarding whether precision = accuracy.
  • Frameworks scoring the same after truncation are and should be represented as "tied." Sub-ordering within ties should be on a non-meaningful quality, such as alphabetic or random.
  • Implementers should be able to turn on greater precision -- proposal would be 3 digits -- for their own testing. This mode should be given a name -- proposal "meaningless mode" -- to indicate that it should be taken with a grain of salt.

Alternate Proposals

Leave rankings as they are and let people use the "Compare results against one framework" tool to tell which differences are statistically meaningful.

Counter-argument: While a very useful tool and also one that is much more statistically sound, providing it does not change the fact that the ranking's primary presentation is a fiction. We should revise that presentation to something supportable.

Use a more sophisticated statistical method than single-digit truncation to group ties

Counter-argument: Because it's difficult to decide how meaningful these results are beyond the benchmark, we should be simple with our methods rather than (too) clever. But if someone has a good idea here, by all means, bring it up.

10% is too large an interval, we should pick something smaller

Counter-argument: Maybe. Current rankings are equivalent to a bucket size of 0%. The smaller the bucket size, the greater the instability.

Thoughts?

@hville
Copy link
Contributor

hville commented Sep 30, 2020

Any reason the default is not the median instead of the mean? That would right away stabilize things by discarding the odd freak runs. However, it would also likely benefit implementation with a higher variance since I guess results are heavy skewed with a long tail of garbage collection bouts.

@krausest
Copy link
Owner

krausest commented Oct 5, 2020

Here's my proposal for a better results table:
https://krausest.github.io/js-framework-benchmark/2020/new_result_table.html

I hope the "compare" link below the geometric average is prominent enough. Another change is that it defaults to using the median value. It also filters all implementations with errors by default.

I thought about truncating the result but I'm really concerned that some frameworks will flip between groups. Of course the authors of those frameworks will be very unhappy and will object that the grouping creates a wrong impression of precision.

I really hope the new compare links makes it easier to compare frameworks based on statistic data (let's just hope the statistics are calculated correctly...). A framework is considered faster than another if it's faster for at least one benchmark and not slower for the rest.

@krausest
Copy link
Owner

@adamhaile I'm closing this issue and will go with the enhanced results table.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants