-
Notifications
You must be signed in to change notification settings - Fork 852
Benchmark is overstating precision of the rankings #803
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Any reason the default is not the |
Here's my proposal for a better results table: I hope the "compare" link below the geometric average is prominent enough. Another change is that it defaults to using the median value. It also filters all implementations with errors by default. I thought about truncating the result but I'm really concerned that some frameworks will flip between groups. Of course the authors of those frameworks will be very unhappy and will object that the grouping creates a wrong impression of precision. I really hope the new compare links makes it easier to compare frameworks based on statistic data (let's just hope the statistics are calculated correctly...). A framework is considered faster than another if it's faster for at least one benchmark and not slower for the rest. |
@adamhaile I'm closing this issue and will go with the enhanced results table. |
Creating an Issue out of the discussion opened here: #772 (comment) .
Currently, this benchmark ranks frameworks using the full precision of the test results. Since there is considerable run-to-run variation in test results (est. from experience ~2% in the geometric mean statistic) and since this run-to-run variation is greater than the actual differences between many frameworks, the rankings are highly unstable. An example would be Surplus, which jumped from 6th in the Chrome 83 rankings to 1st in the Chrome 84 rankings despite no changes to the framework or the implementation (or, as far as I know, relevant changes in the Chrome implementation).
Consequently, this benchmark is claiming a capability -- full ordering of the frameworks -- which it does not in fact have the statistical power to provide. To put it simply, the individual rankings are a fiction.
A secondary issue is whether precision = accuracy, aka whether the rankings represent something "true" or "real" about each framework. Because of the usefulness of this benchmark for implementers, there is risk that implementers may over-fit their frameworks to these tests, resulting in implementations that score better here but are less useful or performant for web development in general. This risk is compounded by the fact that the test requirements inherently cannot be fully defined, relying on a considerable amount of discussion about what's "cheating" and what's "ok." This risk should incline us to be conservative about what differences are presented as meaningful in the rankings.
As a potential solution to the above issues, the proposal here is to:
Alternate Proposals
Leave rankings as they are and let people use the "Compare results against one framework" tool to tell which differences are statistically meaningful.
Counter-argument: While a very useful tool and also one that is much more statistically sound, providing it does not change the fact that the ranking's primary presentation is a fiction. We should revise that presentation to something supportable.
Use a more sophisticated statistical method than single-digit truncation to group ties
Counter-argument: Because it's difficult to decide how meaningful these results are beyond the benchmark, we should be simple with our methods rather than (too) clever. But if someone has a good idea here, by all means, bring it up.
10% is too large an interval, we should pick something smaller
Counter-argument: Maybe. Current rankings are equivalent to a bucket size of 0%. The smaller the bucket size, the greater the instability.
Thoughts?
The text was updated successfully, but these errors were encountered: