-
Notifications
You must be signed in to change notification settings - Fork 11.5k
[SYCL][OPT] Fix reorder optimization for Q4_0 #13003
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL][OPT] Fix reorder optimization for Q4_0 #13003
Conversation
I think one more TODO is to remove setting tensor->extra in llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp Lines 340 to 344 in 2f74c35
|
Thanks for the PR, we'll have a look! Please make sure to keep this PR in review until we have time to review it. |
I agree however the suggested solution to follow the logic from ggml-cpu-aarch64 also sets the llama.cpp/ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp Lines 6323 to 6328 in b9154ec
It's not clear to me how this could be avoided at this stage. |
Yes, wait for you all review. Yes, I have added the suggestion of slaren: consider reorder the tensor when load from GGUF. It's depended on all Q4_0 cases be supported (first item). |
Yes. I think this solution depend on all cases of Q4_0 supported reorder. |
Please test the PR with your LLMs of Q4_0. |
The CPU backend uses extras for simplicity, but if there is no extra data that needs to be stored per-tensor, you can rely on the buffer type alone to determine if the tensor data is reordered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from performance concerns @Rbiessy mentioned already, this resolves my concerns about running reorders for all relevant cases. (link to discussion for convenience: #5277 (reply in thread))
@Rbiessy
A: Yes.
A: The temporary buffer will be released after finish the reorder. It's size is same as current Q4 tensor. |
A: TG is more important than PP in customer cases. In the case of bigger LLM (like llama2-7B) and dGPU (Arc, BMG/PVC), the TG will be increased 20%-70% by this feature. If we want a feature can make both are increased in same time, it's very hard to do in fact. |
The first PR of reorder lead to the wrong result of some LLM Q4_0. This PR fix the issue and won't impact the result of all LLM Q4_0. For normal user, this feature will be ignored if they don't ready the guide. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For normal user, this feature will be ignored if they don't ready the guide. User like OOB feature. I suggest enabling this feature as default.
I agree with this. The only reason we disabled the reorder by default was because it broke some user models. Even if we lose quite a bit of performance in prompt processing, from the user perspective, it feels better when using llama-cli.
However, I don't think this implementation should be final. We are most likely increasing other metrics that we are not really measuring, like the time to first token which also affects the "user" perspective, which relates to @Rbiessy 's concerns.
@NeoZhangJianyu can you explain further what was the issue was with reordered tensor and reorder OP not matching? A Q4_0
tensor using in a different operator?
IMO. Both PP and TG are required for inference. A bad PP performance will result in bad experience, especially in long context LLMs. My suggestion is to disable reorder opt by default until we find solution to fix PP. I agree with @Rbiessy here. |
The numbers I have observed are not "bad" PP, but slightly worse than what we had. Less powerful systems are gonna notice more, but just for the first prompt that is processed and on benchmarks. It would, however, have an impact from starting the application to actually starting to run things, so if @Rbiessy and @qnixsynapse disagree and notice the performance impact I wouldn't push for the merge. |
Are we sure the performance drop in PP is due to the first call to I agree the extra memory usage should be fine since this is just for the first run. I'd suggest in this PR we either don't enable Q4_0 by default or we disable the reorder optimization for the mul_mat case. |
In the previous solution, reorder the Q4_0 tensor by go through all nodes in a model. Then execute the mul_mat_reorder (for example) in mul_mat() function by condition. Because mul_mat_reorder() can't support all src0 and src1 combination cases, we can't reorder all Q4_0 tensors. The condition of reorder tensor should be same as that of execute mul_mat_reorder() in mul_mat(). In this PR, I remove the reorder tensor in other function. Currently, mul_mat_reorder() is implemented in two legacy functions. |
dequantize_mul_mat_vec() is the bottleneck of performance in common LLM, like llama2. By this PR, the wrong result of mul_mat() for Q4_0 is fixed. Otherwise, user will turn to Other backend since all optimizations are enabled as default in other backend. |
Thanks a lot for the explanation |
@NeoZhangJianyu Yes I understand that, you have not answered my question above which is: How about we only use the reorder Q4_0 format for dequantize_mul_mat_vec as this seems to always improve performance? I am not convinced that the reorder layout should be used for matrix multiplication (i.e. when it is used inside |
The reorder method could work for other kernel functions like dequantize_mul_mat_vec() in ggml_sycl_op_mul_mat_sycl(). As the test result, it will reduce the PP. But it only happen once. The next PP won't be impacted. |
I looked into my suggestion myself, measuring the impact of disabling the reorder optimization for matrix-matrix multiplications only with the patch below: diff --git i/ggml/src/ggml-sycl/ggml-sycl.cpp w/ggml/src/ggml-sycl/ggml-sycl.cpp
index 22927338b..2c7aac628 100644
--- i/ggml/src/ggml-sycl/ggml-sycl.cpp
+++ w/ggml/src/ggml-sycl/ggml-sycl.cpp
@@ -2909,7 +2909,7 @@ static void opt_for_reorder(ggml_backend_sycl_context * ctx, const ggml_tensor *
ctx->opt_feature.reorder && //allow this device due to good perf, skip the devices with bad perf.
dst->op == GGML_OP_MUL_MAT && //limit to some supported cases of Q4_0, to do for more cases.
src0->type == GGML_TYPE_Q4_0 &&
- src1->ne[2]==1 && src1->ne[3]==1) {
+ src1->ne[1] == 1 && src1->ne[2]==1 && src1->ne[3]==1) {
ggml_tensor_extra_gpu* extra = (ggml_tensor_extra_gpu*)src0->extra;
if (!extra) return; //only happen in CI/UT permute case. This did not impact PP results. Also for some reason I am not able to reproduce the performance regression on PP that I mentioned in #13003 (comment) anymore. I use 100 iterations for this benchmark but it does not seem enough. For reference the results running on B580 again with the same changes that should match with the third table from #13003 (comment):
Anyway I'm happy with these changes, thanks for the patch. |
@
@Rbiessy How you think this code? |
@NeoZhangJianyu in the models I have been running I have not found cases where matrix-matrix multiplications are a bottleneck for TG, only matrix-vector multiplications. This is why I wanted us to try only enabling the reorder optimization for matrix-vector multiplication. I was worried the reorder optimization with matrix-matrix multiplications could somehow perform worse in some cases (and not just during the first iteration). As I said this looks fine to me now so feel free to merge the PR if you think it is ready. |
Thank you all support! |
Idea: change the rule to call reorder tensor of Q4_0. Move it from initial graph_compute() to execute OPs.
fix the issue that the reordered tensor and reorder OP don't match, that lead to wrong result in some LLM.
Test by pythia-1.4b-Q4_0.gguf.
set reorder optimization feature as default, since fixed the known issues.
rm unused global variable.
fix the bug of missing to reorder the tensors in second call graph_compute() of same context.
It impacts the UT result: some UT cases can't test the reorder feature.
Todo: