Skip to content

opencl: improve profiling #12442

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Mar 18, 2025
Merged

Conversation

lhez
Copy link
Contributor

@lhez lhez commented Mar 18, 2025

  • Wait for profiling events and collect profiling data when model execution is done. This way, the displayed performance numbers are more close to the true performance.
  • Generate a chrome trace in addition to csv.

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Mar 18, 2025
@lhez lhez marked this pull request as ready for review March 18, 2025 04:50
@zhouwg
Copy link
Contributor

zhouwg commented Mar 18, 2025

sorry to bother you, how can I mark a specified PR as ready for review? thanks.
one more thing, can Qualcomm's expert help to review a PR of Qualcomm's QNN backend for llama.cpp from an independent programmer(here is me and I already bought Qualcomm's Snapdragon 8gen3 equipped Android phone for personal learning dev activity and plan to buy a Snapdragon 8 Elite equipped phone for further dev activity later)?I know your team already has a senior technical expert whom knows everything about ggml-qnn and I hope can get help from your colleague. thanks too much.

Copy link
Collaborator

@max-krasnyansky max-krasnyansky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, and reminds me that I wanted to integrate graph-profiler branch with opencl and cpu backends.

@max-krasnyansky max-krasnyansky merged commit d84635b into ggml-org:master Mar 18, 2025
46 of 47 checks passed
@max-krasnyansky
Copy link
Collaborator

sorry to bother you, how can I mark a specified PR as ready for review? thanks.

If you start a Draft PR there is a way to mark it ready. qnn-backend PR is not marked as draft.

one more thing, can Qualcomm's expert help to review a PR of Qualcomm's QNN backend for llama.cpp from an independent programmer(here is me and I already bought Qualcomm's Snapdragon 8gen3 equipped Android phone for personal learning dev activity and plan to buy a Snapdragon 8 Elite equipped phone for further dev activity later)?I know your team already has a senior technical expert whom knows everything about ggml-qnn and I hope can get help from your colleague. thanks too much.

I've been keep an eye on it. In general, I'd say QNN is not the right solution here but I'll take another look.

@zhouwg
Copy link
Contributor

zhouwg commented Mar 19, 2025

thanks for your valuable time and attention on ggml-qnn(Qualcomm's backend for llama.cpp through Qualcomm QNN SDK) and my third formal PR of ggml-qnn.

  1. is "QNN is not the right solution here" means we can't utilize the Hexagon NPU maximally through Qualcomm's QNN SDK in llama.cpp? even through the "second tech approach(mapping the entire ggml cgraph to a single QNN graph)"?

  2. that's why the GPU performance in that PR(or all similar PR) is so bad at the same time Intel's ggml-sycl is so good? accordingly, that's why Qualcomm provide ggml-opencl for Qualcomm's GPU backend(right solution for Qualcomm GPU backend in llama.cpp)?

  3. that's why the NPU performance in that PR(or all similar PR) is so bad?

  4. that's why Qualcomm didn't provide an official ggml-qnn PR through the general approach in Intel's ggml-sycl or the second tech approach(mapping the entire ggml cgraph to a single QNN graph) with QNN SDK currently?

  5. from this diagram,
    qualcomm-qnn-sdk

we can see the key-point of "how to utilize the Hexagon NPU maximally through QNN SDK" is that we should compose an ideal single QNN graph from an entire/complete ggml cgraph. as my personal understanding, that's exactly the core principle(Qualcomm's dedicated binary tool convert a specified LLM model to a prepared QNN graph) in QNN SampleApp or Genie stack.
Screenshot from 2025-03-19 08-51-23

Screenshot from 2025-03-19 08-28-59

I don't understand some tech questiones:

  • why "QNN is not the right solution here"? is there any other AI software stack or SDK provided by Qualcomm for llama.cpp's Hexagon NPU backend?
  • or the first tech approach(general approach in Intel's ggml-sycl) and the second tech approach(mapping a complete/entire ggml cgraph to a single QNN graph through QNN SDK) in that PR are both incorrect?
  • or the second tech approach(mapping the entire/complete ggml cgraph to a single QNN graph through QNN SDK) is correct path but the practice in that PR is not exactly correct? I understand the shared-buffer or memory-pool also should be used in the second tech approach.

I or/and the entire llama.cpp community need your further guidance and help.

thanks so much again.

@zhouwg
Copy link
Contributor

zhouwg commented Mar 20, 2025

@max-krasnyansky, thanks so much for your valuable guidance/correction on direction.

I think I know something about the third tech approach of "utilize the Hexagon NPU maximally", in other words, the Hexagon DSP SDK should be used in the third tech approach, which is exactly similar to what your excellent engineering team did with ggml-opencl, or which is exactly similar to what I did with video decoding hardware acceleration many years ago(it's also a DSP chip).

my guess might-be not correct, accordingly, it's greatly appreciated that you can give me/the llama.cpp community a clear explanation or a roughly confirmation of the third tech approach.

@zhouwg
Copy link
Contributor

zhouwg commented Mar 21, 2025

sorry to bother you, how can I mark a specified PR as ready for review? thanks.

If you start a Draft PR there is a way to mark it ready. qnn-backend PR is not marked as draft.

one more thing, can Qualcomm's expert help to review a PR of Qualcomm's QNN backend for llama.cpp from an independent programmer(here is me and I already bought Qualcomm's Snapdragon 8gen3 equipped Android phone for personal learning dev activity and plan to buy a Snapdragon 8 Elite equipped phone for further dev activity later)?I know your team already has a senior technical expert whom knows everything about ggml-qnn and I hope can get help from your colleague. thanks too much.

I've been keep an eye on it. In general, I'd say QNN is not the right solution here but I'll take another look.

yes, you are absolutely correct, thanks so much. I'll try to test it on a rooted 8gen3 phone to get better NPU performance(closer to the default ggml backend or better then the default ggml backend).

Screenshot from 2025-03-21 19-25-26

Screenshot from 2025-03-21 21-31-29

Screenshot from 2025-03-21 19-25-07
Screenshot from 2025-03-21 19-24-24
Screenshot from 2025-03-21 19-25-45

[updated on 03/21/2025,23:54] offload to cDSP directly is really much faster than QNN-CPU/QNN-GPU/QNN-NPU backend. now GGML_OP_ADD works fine on cDSP and verified with llama-cli: the NPU performance is good.

ggml-hexagon-offload-to-cdsp-directly.mp4

I think this approach is exactly similar to ggml-opencl: we need write some "kernels" on cDSP and then offload specified ggml ops to cDSP directly. there is a question in this approach: build the entire source code is not easy because one of parts are running on AP(arm-cpu) and one of parst are running on cDSP(NPU), at the same time, opencl provide a mature mechanism to manage/compile "kernels" naturally.

hope mulmat can also works fine on cDSP and NPU performance is also good tomorrow, then that PR can be reviewed formally.

[updated on 03/22/2025,23:22] source code has been submitted in that PR. there is an unknown issue about mulmat on cDSP. another issue in that PR: I didn't find a general approach to compile/build the hexagon kernel functions in that PR at the moment and a manual method can works fine in my local dev envs.

@zhouwg
Copy link
Contributor

zhouwg commented Mar 24, 2025

[updated on 03/24/2025,15:32]mulmat performance between QNN-NPU and cDSP:

mulmat performance between QNN-NPU and cDSP

case-1:
mulmat with QNN-NPU:

Screenshot from 2025-03-24 15-53-22
Screenshot from 2025-03-24 15-54-38

mulmat with cDSP(with naive algorithm):
Screenshot from 2025-03-24 15-47-12

Screenshot from 2025-03-24 15-46-34

case-2:
mulmat with QNN-NPU:

Screenshot from 2025-03-24 20-21-57
Screenshot from 2025-03-24 20-21-41

mulmat with cDSP(with naive algorithm):
Screenshot from 2025-03-24 20-22-43
Screenshot from 2025-03-24 20-22-31

case-3:
mulmat with QNN-NPU:
Screenshot from 2025-03-24 20-42-17
Screenshot from 2025-03-24 20-42-04

mulmat with cDSP(with algorithm from ggml-dsp(a tiny customized ggml on cDSP), but the compute result is not correct):
Screenshot from 2025-03-24 20-43-01
Screenshot from 2025-03-24 20-42-43

mulmat with cDSP(with naive algorithm):
Screenshot from 2025-03-24 20-52-01
Screenshot from 2025-03-24 20-51-51

case-4:
mulmat with QNN-NPU:
Screenshot from 2025-03-24 21-25-17
Screenshot from 2025-03-24 21-25-07

mulmat with cDSP(with naive algorithm):
Screenshot from 2025-03-24 21-23-35
Screenshot from 2025-03-24 21-23-25

@max-krasnyansky, I think this testcase can verify what you said "QNN is not the right solution here" is absolutely correct if there is no misunderstanding or mistake in this testcase/experiment: we can clearly see that performance of mulmat with cDSP is much faster then mulmat with QNN-NPU.

unfortunately, this hand-written implementation of mulmat on cDSP is very naive(without any optimization) and can only pass various case in UT with self-made command line program, it can't works with llama-cli.

I think AI experts can have chance to improve the mulmat algorithm on cDSP if that PR can be approved: this approach is exactly what you pointed out, developers and AI experts can do a lot on the cDSP rather than based on the black box in the QNN SDK.

I'd like to make a little contribution to Qualcomm although I'm an independent programmer, thanks so much!

@zhouwg
Copy link
Contributor

zhouwg commented Mar 25, 2025

[updated on 03/25/2025,21:10]
now I successfully ported the original ggml_compute_forward_add(here is ggml_compute_forward_add_f32, other can be added easily) and ggml_compute_forward_mul_mat on cDSP.

case-1: small matrix ( cDSP's performance is much much faster than QNN-NPU)
./scripts/build-run-android.sh run_testop MUL_MAT 2 with inference_approach = 1 in scripts/ggml-qnn.cfg (through cDSP directly)
Screenshot from 2025-03-25 21-09-46

./scripts/build-run-android.sh run_testop MUL_MAT 2 with inference_approach = 0 in scripts/ggml-qnn.cfg (through QNN)

Screenshot from 2025-03-25 21-17-47
Screenshot from 2025-03-25 21-18-08

case-2: big matrix

mulmat UT through QNN-NPU:
Screenshot from 2025-03-25 21-27-55
Screenshot from 2025-03-25 21-27-35

mulmat UT through cDSP directly:
Screenshot from 2025-03-25 21-30-02
Screenshot from 2025-03-25 21-29-24

@zhouwg
Copy link
Contributor

zhouwg commented Mar 27, 2025

[updated on 03/27/2025,17:49]
NPU performance between QNN and cDSP ------ GGML_OP_ADD

case-1:
provide a special case to avoid complex algorithm(padding operation) on cDSP side:
matrix src0:fp32 1024 * 1024
matrix src1:fp32 1024 * 1024

with QNN-NPU:
Screenshot from 2025-03-27 17-51-09

with cDSP (using algorithm from original ggml):
Screenshot from 2025-03-27 17-52-27

with cDSP(using Hexagon SIMD instructions with naive algorithm, without algorithm from original ggml):
Screenshot from 2025-03-27 17-53-40
Screenshot from 2025-03-27 17-55-28

with cDSP(using Hexagon SIMD instructions with algorithm from original ggml):
Screenshot from 2025-03-27 18-12-25

with cDSP(using Hexagon SIMD instructions with multithread algorithm from original ggml)
TBD(because HVX multithread programming is not easy at the moment, and already got 10x performance gain with GGML_OP_ADD on cDSP)

[updated on 03/28/2025, 18:53] I'm sorry that the data in above testcase is not exactly correct/accurate because there is a bug in my code with HVX SIMD intrinsic:
Screenshot from 2025-03-28 18-41-29

because qfloat32 is different with standard IEEE-754 float.

the performance of GGML_OP_ADD with cDSP is still much faster than GGML_OP_ADD with QNN-NPU although Q6_Vsf_equals_Vqf32 should be used for convert qf32 to fp32.

NPU performance between QNN and cDSP ------ GGML_OP_ADD

case-1:
provide a special case to avoid complex algorithm(padding operation) on cDSP side:
matrix src0:fp32 1024 * 1024
matrix src1:fp32 1024 * 1024

with QNN-NPU:
Screenshot from 2025-03-28 18-48-00

with cDSP (using algorithm from original ggml):

Screenshot from 2025-03-28 18-50-50

with cDSP(using Hexagon SIMD instructions with naive algorithm, without algorithm from original ggml):
Screenshot from 2025-03-28 18-51-47

with cDSP(using Hexagon SIMD instructions with algorithm from original ggml):
Screenshot from 2025-03-28 18-52-35

HVX SIMD intrinsic is not easy and I'm learning them(the principle is similar to Intel's x86 MMX/SSE and I have some experiences with Intel's x86 ISA and MMX/SSE instructions), I think the peformance of GGML_OP_ADD on cDSP will more faster if Hexagon L2-cache prefetch and HVX multithreading are both enabled/utilized, might-be faster than Kryo CPU. the principle here should be same to GGML_OP_MUL_MAT/other performance-sensitive GGML op on cDSP.

@zhouwg
Copy link
Contributor

zhouwg commented Apr 5, 2025

1.another more obvious case:

GGML_OP_ADD performance comparison between QNN-NPU and cDSP in real LLM inference (test phone is a Snapdragon 8 Gen3 Android phone and a Snapdragon 8 Elite(aka 8 Gen4) Android phone, test model is qwen1_5-1_8b-chat-q4_0.gguf)

  git clone https://github.com./zhouwg/ggml-hexagon
  cd ggml-hexagon
  git checkout pr_to_upstream
  cd ggml-hexagon
  ./scripts/build-run-android.sh build

LLM inference through HWACCEL_CDSP(offload GGML_OP_ADD to cDSP directly)

./scripts/build-run-android.sh run_llamacli

Screenshot from 2025-04-05 19-39-57

LLM inference through HWACCEL_QNN(offload GGML_OP_ADD to QNN_NPU)(modify hwaccel_approach to 0 --- hwaccel approach through QNN--- in scripts/ggml-hexagon.cfg and then run)

./scripts/build-run-android.sh run_llamacli

Screenshot from 2025-04-05 19-42-26

  1. all quantized mulmat can works fine on cDSP side in my local dev envs at the moment(since April/04/2025), I'm working on improve mulmat's performance with HVX SIMD instructions and HVX multithreading.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants