-
Notifications
You must be signed in to change notification settings - Fork 341
Shift in performance characteristic between 1.6.2 and master #848
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
cc/ @stjepang |
Ok this actually seems like great news. I'm pretty confident I know what's going on and am hopeful that we can get the best of both worlds! I'll make an experiment later today and make a branch of async-std so we can compare it to 1.6.2 and master. Briefly:
|
I totally agree :) again, this wasn't meant as a complaint at all. We are mostly on the left side (UMA + few cores) anyway in production so the speed boost here is absolutely great! I just have the thread ripper to work/run benchmarks on and figured that change in behavior is something worth sharing as if nothing else it's interesting and perhaps not obvious. BTW great work on the changes to master this comes really really close to thread performance <3. as a side note: master is within 10-15% of what we used to get with ideally laid out threaded code :) (i.e. 3 threads 3 cores, each working over a |
@Licenser Could you perhaps try this branch? https://github.com./stjepang/async-std/tree/experiment |
Sorry for the late reply the tests take a while :) I re-run the 1.6.2 branch for a fair comparison since I did some CPU fiddling yesterday evening and wanted to make sure it's a fair comparison. The general bump in throughput is due to (the aforementioned CPU fiddling) but the delta between the 48 thread variants looks slightly lower 1.6.2 (throughput in MB/s)
experiment (throughput in MB/s)
ratio 1.6.2 / candidate
perffor the completeness I added a perf recording of the 64 / 48 run
|
Thank you for such a detailed report! Two more questions:
|
No worries, I appreciate the help! smol and async-std are incredible :D
|
Do you know how to fix this error?
|
sorry about that, you got to tell tremor where it's standard lib is. this should work:
|
Thank you, that worked! :) I haven't made much progress, however, because compilation is really really slow. Do you think it would be possible to make a benchmark that does roughly the same thing, but in a simpler program that is quicker to compile? |
I'll give it a try tomorrow morning, a simple deserialize on one side, run some script in the middle serialize on the other might be good enough (famous last words) I'll see how that goes, it lacks a bit of fidelity but perhaps it models it well enough |
Heya I added a little (very simplified) test:
can run it and with a bit of text editing it spits out some throughput measurements. It's not a true replication of the original benchmark it leaves out the whole return channel shenanigans we're doing, as well as the mini interpreter we run against each event. The differences are not as big but I think we can observe the same pattern: in 1.6.2 low core count/shared cache are slower but high core count / no shared cache is faster compared to
experiment
1.6.2
-- edit below threadsFor the sake of it I translated the benchmark to threads as well (same repo
code is here: |
Since it ended up not needing any dependencies on tremor specific code I moved this out to it's own repo (less cloning / .lock stuff ): https://github.com./wayfair-tremor/rt-bench |
Some thoughts... Now that Looking at the numbers again, it seems that 3 cores are always faster than 48 cores :) So perhaps being less efficient makes you faster at higher core counts (ironically). I don't have proof for this hypothesis, but it might be why v1.6.2 is winning at 48c benchmarks. |
I've been thinking about that too, but I'm not super sure how to best approach that. I think to automate that tasks would need to "understand" how they are related. i.e. if task-a and task-b communicate they probably want to share a NUMA-executor, if they don't they probably don't want to. It would be super cool to be able to describe those relations but to be fair I'm not sure if the overhead of that would be slowing it down (due to extra logic etc.). The whole thing gets a bit more complicated with systems like Ryzen where NUMA isn't enough to explain the architecture. Even so a CCD exposes itself as a single NUMA node the two CCX's don't share all the caches so having a task move between CCX's inside a CCD would still cause a slowdown. I found no other way to determine that other then measuring distance with crossbeam-channels (see: https://github.com./Licenser/trb) so if there are NUMA aware executors perhaps making the grouping of cores pluggable would be nice. That way it would be possible to write discoverers to create groupings. |
I'm curious: when deploying this to production, do you use all available cores or just a small number of them (which I would expect to deliver the best performance)? Some interesting reading:
EDIT: fixed the second link |
The answer to that is tricky, so the answer to deployment will be a bit longer ;) tremor.rs (where we use smol/async-std) is open source so we often have no control over the deployment. Internally at Wayfair, we currently deploy mostly on VMs with small core numbers (2c4t) so they will be on the same NUMA node. (That deployment strategy isn't ideal for everyone.) We recently moved from threads to async (some writeup on that https://www.tremor.rs/blog/2020-08-06-to-async-or-not-to-async/ ) with the goal to include parallelism where possible, i.e. instead of having 3 threads on a system we might have 30 tasks. Naive thread implementations don't scale well, async and async-std/smol do a lot of heavy lifting (yay thank you! <3). For benchmarks, we run on both Intel and AMD systems to get a feeling of how the software behaves there, given the characteristics of the platform are different enough now that simply optimizing for one might bite you on the other. And the differences are quite interesting especially since unlike NUMA it's not easily / automatically discoverable. I've read the 3rd article, can't access the 2nd (thanks google!) and the 1st one is new :) I'll go take a read. |
cc @mfelsche |
NOTE: Not a complaint just an observation that might be useful to discuss / design / improve that scheduling model.
Hi I ran our benchmarks with 1.6.2 and current master after #836 was merged and I noticed a rather significant change in performance characteristics.
The following two tables show the throughput of MB/s in the benchmark at a different number of concurrent tasks and different CPU setups.
The benchmarks are run on a Ryzen Threadripper 24 Core (48 thread) CPU. It's worth noting that these CPU houses (simplified) 8 CCXs (3 cores/6 threads each) that share their cache, otherwise they communicate over an internal bus (somewhat acting as different NUMA domains).
So in the 3 core test, the three cores specifically are picked to be on the same CCX, meaning they share their cache, in the 48 core tests all cores (or threads) are used meaning no chances are shared by default.
It looks like that in 1.6.2 tasks that were related were rather likely to a be scheduled on the same executor (or an executor on the same CCX) so the 3 steps in the benchmark,
deserialize
->process
->serialize
were more often executed on the same CCX (leading to higher performance in the 48 thread case) however overall was less performant (in the 3 core case).master switches this around, it seems to more often than not schedule-related tasks on new cores leading to a (sometimes much) lower performance on the 48 core case but much better performance on the limited core set with the shared cache.
We observed the same with threads (to a stronger degree), that threads outperformed tasks on the 3 core setup but unless pinned would be scheduled around so much that the loss of cache would obliterate performance.
It seems that master is very friendly to UMA systems (or generally systems where all cores share cache) but less friendly to NUMA systems that don't have that luxury (counting the Ryzen platform here even so it's a bit of a special case).
1.6.2 (throughput in MB/s)
master (throughput in MB/s)
The text was updated successfully, but these errors were encountered: