Skip to content

SYCL: Rename oneMKL to oneMath #12192

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Apr 1, 2025

Conversation

Rbiessy
Copy link
Collaborator

@Rbiessy Rbiessy commented Mar 5, 2025

oneMKL Interface was moved to uxlfoundation and renamed to oneMath. This PR updates the SYCL backend to use the new library.

So far the Intel backend has been using the Intel oneMKL directly. After the renaming Intel oneMKL and oneMath have different namespaces so it is much easier to ensure all SYCL backends use oneMath. As a consequence, INTEL_CPU and INTEL_GPU targets have been introduced to make it possible to use oneMath compile-time dispatcher. The INTEL target can still be used but will fallback to the runtime dispatcher which can introduce a small overhead.
Unlike Intel oneMKL, oneMath is not available as a pre-built package. To ensure it is still as easy to use as before oneMath can be fetched and compiled automatically if not provided with oneMath_DIR.

The new version of oneMath had CMake improvements which make it possible to properly integrate oneMath with CMake rather than relying on environment variables such as LIBRARY_PATH or CPLUS_INCLUDE_DIR.

Using get_onemath_backend lets us avoid duplicating paths when using oneMath with different backends. The changes will help us ensure we avoid compilation issues when using APIs only available in Intel oneMKL.

I believe the changes will also fix the issues mentioned in #10851 in a cleaner way.

@github-actions github-actions bot added documentation Improvements or additions to documentation examples ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Mar 5, 2025
@Rbiessy Rbiessy requested a review from Alcpz March 5, 2025 10:52
@Rbiessy
Copy link
Collaborator Author

Rbiessy commented Mar 5, 2025

This change should not affect performance for Intel backends as long as the compile-time dispatcher is used. Some performance results running on PVC before my changes:

model size params backend ngl threads sm mmap test t/s
llama 70B Q4_K - Small 37.57 GiB 70.55 B SYCL 99 8 none 0 pp512 512.45 ± 1.26
llama 70B Q4_K - Small 37.57 GiB 70.55 B SYCL 99 8 none 0 tg128 4.90 ± 0.00
llama 8B Q4_K - Medium 4.58 GiB 8.03 B SYCL 99 8 none 0 pp512 3411.37 ± 14.70
llama 8B Q4_K - Medium 4.58 GiB 8.03 B SYCL 99 8 none 0 tg128 34.82 ± 0.18
llama 8B Q8_0 7.95 GiB 8.03 B SYCL 99 8 none 0 pp512 3230.48 ± 21.10
llama 8B Q8_0 7.95 GiB 8.03 B SYCL 99 8 none 0 tg128 21.61 ± 0.10

with these changes:

model size params backend ngl threads sm mmap test t/s
llama 70B Q4_K - Small 37.57 GiB 70.55 B SYCL 99 8 none 0 pp512 514.92 ± 3.08
llama 70B Q4_K - Small 37.57 GiB 70.55 B SYCL 99 8 none 0 tg128 4.89 ± 0.00
llama 8B Q4_K - Medium 4.58 GiB 8.03 B SYCL 99 8 none 0 pp512 3399.59 ± 13.79
llama 8B Q4_K - Medium 4.58 GiB 8.03 B SYCL 99 8 none 0 tg128 34.74 ± 0.16
llama 8B Q8_0 7.95 GiB 8.03 B SYCL 99 8 none 0 pp512 3259.86 ± 10.15
llama 8B Q8_0 7.95 GiB 8.03 B SYCL 99 8 none 0 tg128 21.53 ± 0.08

@s-Nick
Copy link
Collaborator

s-Nick commented Mar 6, 2025

Changes look good to me, but when I build it from scratch there are a lot of warnings from oneMath.
Is there a way to hide them? I think it would provide a better user experience and also it would helps developers to monitor warnings only from llama.cpp

Copy link
Collaborator

@NeoZhangJianyu NeoZhangJianyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oneMath is new & good product for multiple platform.
It's good to be used to support multiple GPUs.

Because SYCL backend has been used in some commercial products for Intel GPU, we must provide stable solution based on stable comments.

So, I think for Intel GPU, we still continue using components provided by official oneAPI: compiler, oneMKL, oneDNN.

If you want to use oneMath right now, please use Macro to define new branch for non Intel GPU code path.
When it's done, please build and verify on pure oneAPI environment for Intel GPU.

Additionally, please consider if need to support Intel CPU in SYCL backend.
I don't think it bring benefit to Intel CPU user than CPU backend.

Thank you!

beta_value, data_c, ldc);
#endif
}
template <class Ta, class Tb, class Tc, class Ts>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest warp gemm() functions of oneMKL and oneMath in one unify API() in dpct::helper, so that make the ggml code is clear & simple.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the suggestion. I am still planning to make the switch to always use oneMath so I don't think it is relevant in that case?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Intel path, I suggest using oneMKL which comes from oneAPI official package, instead of oneMath as 3rd party.

If define a unified API, like dpct::gemm(dcpt::transpose a_trans, xxxx, int ldc), this new function will hide the difference of oneMath and oneMKL.

That means ggml-sycl.cpp won't call oneMKL or oneMKL directly, only dpct::helper call oneMKL or oneMath.
So when we change low level code, ggml-sycl.cpp won't be impacted directly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I still prefer the approach of using oneMath only which already hides the differences between Intel oneMKL and other vendors libraries.

Copy link
Collaborator

@qnixsynapse qnixsynapse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested on an Intel A750 with intel-oneapi-base-toolkit 2025.0.1, works okay..

LGTM.

Edit: Also, we should remove the mention of Intel CPU from the docs as this backend only targets GPUs.

@NeoZhangJianyu
Copy link
Collaborator

NeoZhangJianyu commented Mar 7, 2025

I tested the code in pure oneAPI. It's passed. It's good!

I don't oppose to add new packages for non-intel GPU. I understand you.
I hope SYCL backend for Intel GPU use stable official packages.

I suggest divide the code of intel GPU and non Intel GPU in SYCL backend if you want to add oneMath and more packages:

In code level, I have two alternatives:

  1. Design a new class to provide unified API for gemm() functions based on oneMKL, or oneMatch or oneDNN. (recommend)
    for intel, call oneMKL or oneDNN like base.
    for non-intel, call oneMath.

    In CMakefile.txt, I suggest design two branches for intel and non-intel.
    So you only update the non-intel branch.

  2. Divide SYCL backend into two: SYCL backend, Multiple SYCL backend.

  • SYCL backend support Intel device only.
  • Multiple SYCL backend support Intel/NV/AMD devices.
    These two backends are maintained separately and less conflict.
    They could refer to each other's code and design.
    The code could be clean and compact. In current code, there are branches for intel or non-intel. And they will be more with the more optimization in the future.

Thank you!

@NeoZhangJianyu
Copy link
Collaborator

Another concern is the CMake will download the oneMath during building.
That will impact the build speed.

Is it possible to move the download from CMake to prepare work like install oneAPI?

Thank you!

@Rbiessy
Copy link
Collaborator Author

Rbiessy commented Mar 12, 2025

@NeoZhangJianyu @airMeng there are a few comments regarding concerns of introducing oneMath for Intel devices.

Pasting my same reply from one of the comment: The SYCL backend is meant to be a generic backend that can support multiple HW so we should avoid splitting for different paths as long as we can maintain performance. oneMath is designed to solve this issue exactly.

Another concern is the CMake will download the oneMath during building.

Using FetchContent will download oneMath during CMake configuration time already, see https://cmake.org/cmake/help/latest/module/FetchContent.html. A user can also choose not to use FetchContent by building oneMath separately and installing it or setting -DoneMath_DIR.

@Rbiessy
Copy link
Collaborator Author

Rbiessy commented Mar 13, 2025

@NeoZhangJianyu @airMeng I believe I have addressed all the comments. Do let me know if you have more concerns with the approach that I haven't answered. If not please do not keep the PR blocked, thanks!

@NeoZhangJianyu
Copy link
Collaborator

Thank your patience to answer the comments! :)

  1. I think oneMath is immature product now.
  2. I don't oppose to use it in non-Intel GPU case.
    If the oneMath is only used in non-Intel GPU case, I agree to merge it.

Here is the reason:
For 1:
oneMath's latest release is 0.6 on No.7 Nov. 2024.
There are still warning in building oneMath log in CI.
oneMath can't cover all APIs of oneMKL by now. That means we can't use some existed API of oneMKL by oneMath API.

For 2:

  1. The only benefit I learn is reduce the code branch of oneMKL in Intel and non-Intel case.

  2. Side-effect:

a. need more building time:
       It need to build from source code when using it in llama.cpp.
      Including download and build it, the building time is increased from 21s to 42s in linux.

b. create more lib files:
     In windows, it create 3 dll files in build/bin:
         557056      onemath.dll
        121745920 onemath_blas_mklcpu.dll
        121745920 onemath_blas_mklgpu.dll

         It take more time when link to new dll files, but I didnt record.

     In liunx, it create 3 files in `build/lib`:
         2703600  libonemath_blas_mklcpu.so.0
         2703600  libonemath_blas_mklgpu.so.0
         914544    libonemath.so.0


  I don't know if they need to be included in llama.cpp release package.
  If yes, it means more disk cost.

c. break the offline build style of SYCL backend.
   Through oneMath is downloaded during make automatically, it still add dependence.
   Some developer access github unstable by internet.
   Additional download will increase the risk of CI and build time.

d. Add new warnings of oneMath.

I think oneMath is good product to reduce the multiple APIs cross platforms.
But it's not good enough now.
SYCL backend is used by some commercial developers for Intel GPU, instead of non-Intel GPU.
That's why I always high light to use the stable official packages (like oneAPI toolkit).

For non-Intel GPU, oneMath is OK. Maybe nobody care above side effect.

Above is all I want to say.

If you still like to merge this PR, I could approve.
But could I will create another PR to remove the side effort above of Intel GPU by skip using oneMath for Intel GPU?

By the way, when we create SYCL backend, Intel GPU is only target device. It's same of mine now.
I contribute to llama.cpp as personal contributor in my spare time.
It's due to my interesting and I hope more Intel GPU users can enjoy llama.cpp.

Thank you!

@NeoZhangJianyu
Copy link
Collaborator

@zunigasllc why spam join this review?
Is it so hot PR? :)

@qnixsynapse
Copy link
Collaborator

@NeoZhangJianyu please ignore, it's a spam account.

@Rbiessy
Copy link
Collaborator Author

Rbiessy commented Mar 18, 2025

@NeoZhangJianyu, thank you for the feedback.

Regarding your point 1 the CI only see a few warnings in the tests on linux related to the random number generation. They are harmless and not enough to consider the library not stable. oneMath is only used for GEMM and batched GEMM. As far as I know all the features provided by Intel oneMKL for these operations are also available in oneMath. If that's not the case we can help integrate the missing features in oneMath.

Regarding the side effects in your point 2, the main benefit is indeed to reduce the code branching. I believe this helps SYCL developers in general. It would also help avoid issued such as this one where a namespace used is only available in Intel oneMKL but not oneMath.
a. & b. I agree there is a cost for the building time and disk space (under 4MB in total). Note that with the recent changes the libraries libonemath_blas_mklcpu.so should not be built anymore. With ccache enabled the compilation time will be barely measurable.
c. As documented here users can still download, build and install oneMath separately if it is important for them. Yes it is one more dependency, that is the solution that has been designed by Intel to provide generic SYCL packages. Note that we have also tried to push for oneMath to be available as a pre-built packages in the discussions here. This is not a priority at the moment but integrating oneMath in llama would be one more argument to enable this. It is a chicken-egg problem, we can't improve oneMath if it is not used anywhere and we can't use it in projects if it is not "good enough". From our experience with oneMath it is indeed mature enough, it is being used in GROMACS.
d. I have fixed the warnings showing up. Llama enable strict warnings globally, these have been disabled for oneMath. As mentioned above, the Intel CI does not show warnings with the default warnings for the library itself. See for instance this job from this GitHub CI running.

If you still like to merge this PR, I could approve.

Yes it would really help us if you don't block such PRs, thanks. Your feedback is welcomed though!

But could I will create another PR to remove the side effort above of Intel GPU by skip using oneMath for Intel GPU?

If customers face issues I think we should first look into a solution to fix them in oneMath. If it is not possible we could consider reverting to using Intel oneMKL directly. Until then I would like to try using oneMath.

I hope this answers all of the remaining concerns.

@airMeng
Copy link
Collaborator

airMeng commented Mar 19, 2025

how about the performance on BMG/LNL? The performance number you show is only on PVC.

@NeoZhangJianyu
Copy link
Collaborator

@Rbiessy
I still think Intel GPU won't be impacted by the PR (oneMath).
In this year, the optimization gate will opened, but Intel, CUDA/Hip can't walk on same path:
It's impossible to make SYCL backend get good performance on Intel, NV & AMD GPU by same code.

SYCL backend is more important to Intel user and ISV on Intel GPU.
We must make sure this path is always stable and fast.

I hope allow me create another PR to skip oneMath for Intel GPU.

@Rbiessy
Copy link
Collaborator Author

Rbiessy commented Mar 19, 2025

how about the performance on BMG/LNL? The performance number you show is only on PVC.

@airMeng I did not run them before as I was confident enough it would not show any measurable difference either. The results are below for a couple of small models. The differences are just noise.

master results on BMG
model size params backend ngl sm mmap test t/s
llama 8B Q4_K - Medium 4.58 GiB 8.03 B SYCL 99 none 0 pp512 2074.35 ± 2.41
llama 8B Q4_K - Medium 4.58 GiB 8.03 B SYCL 99 none 0 tg128 31.75 ± 0.16
llama 8B Q8_0 7.95 GiB 8.03 B SYCL 99 none 0 pp512 1921.31 ± 335.13
llama 8B Q8_0 7.95 GiB 8.03 B SYCL 99 none 0 tg128 10.55 ± 0.02
PR results on BMG
model size params backend ngl sm mmap test t/s
llama 8B Q4_K - Medium 4.58 GiB 8.03 B SYCL 99 none 0 pp512 2075.38 ± 4.72
llama 8B Q4_K - Medium 4.58 GiB 8.03 B SYCL 99 none 0 tg128 32.80 ± 0.18
llama 8B Q8_0 7.95 GiB 8.03 B SYCL 99 none 0 pp512 1897.90 ± 277.29
llama 8B Q8_0 7.95 GiB 8.03 B SYCL 99 none 0 tg128 10.63 ± 0.01
master results on LNL
model size params backend ngl threads sm mmap test t/s
llama 8B Q4_K - Medium 4.58 GiB 8.03 B SYCL 99 8 none 0 pp512 534.03 ± 1.25
llama 8B Q4_K - Medium 4.58 GiB 8.03 B SYCL 99 8 none 0 tg128 12.10 ± 0.85
llama 8B Q8_0 7.95 GiB 8.03 B SYCL 99 8 none 0 pp512 548.45 ± 4.34
llama 8B Q8_0 7.95 GiB 8.03 B SYCL 99 8 none 0 tg128 3.86 ± 0.01
PR results on LNL
model size params backend ngl threads sm mmap test t/s
llama 8B Q4_K - Medium 4.58 GiB 8.03 B SYCL 99 8 none 0 pp512 535.93 ± 2.77
llama 8B Q4_K - Medium 4.58 GiB 8.03 B SYCL 99 8 none 0 tg128 11.84 ± 1.00
llama 8B Q8_0 7.95 GiB 8.03 B SYCL 99 8 none 0 pp512 574.42 ± 2.25
llama 8B Q8_0 7.95 GiB 8.03 B SYCL 99 8 none 0 tg128 3.99 ± 0.19

@Rbiessy
Copy link
Collaborator Author

Rbiessy commented Mar 19, 2025

Intel, CUDA/Hip can't walk on same path: It's impossible to make SYCL backend get good performance on Intel, NV & AMD GPU by same code.

SYCL backend is more important to Intel user and ISV on Intel GPU. We must make sure this path is always stable and fast.

@NeoZhangJianyu We agree on this. We are not trying to make all backends use the same paths for every operation. For the common fp16 or fp32 gemm it is possible because all vendors provide native libraries with similar APIs. We also want to make sure the performance is optimal on Intel devices. This is just helping us clean some code.
From what we have seen so far oneMath operations are not that much used, in particular if oneDNN is enabled. Hence I think you are making this a bigger problem than what is really. The biggest challenge to improve the SYCL backend performance today is to optimize the quantized matmul variants which do not rely on oneMath nor oneDNN.

I hope allow me create another PR to skip oneMath for Intel GPU.

As I said above, if there are important issues first let us know what they are and we can discuss the solutions then. We could find a solution to allow using Intel oneMKL again if needed.

@NeoZhangJianyu
Copy link
Collaborator

NeoZhangJianyu commented Mar 20, 2025

I have listed the side-effect of this PR.

Copy link
Collaborator

@airMeng airMeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good on BMG/LNL, waiting for more feedback from the user side

@Rbiessy
Copy link
Collaborator Author

Rbiessy commented Mar 20, 2025

For context @NeoZhangJianyu found that oneMath is a large library on Windows of around 100MB. This is due to static linking, I created an issue about it here: uxlfoundation/oneMath#654

I agree this is a valid concern so I'll revert to using Intel oneMKL directly for Intel devices.

@Rbiessy
Copy link
Collaborator Author

Rbiessy commented Mar 28, 2025

@NeoZhangJianyu I've switched back to using Intel oneMKL for Intel devices in 995aea3

Let me know if you have any concern with this approach, thanks.

@@ -300,6 +280,8 @@ For AMD GPUs we should expect at least one SYCL-HIP device [`hip:gpu`]:

### II. Build llama.cpp

The SYCL backend depends on [oneMath](https://github.com./uxlfoundation/oneMath). By default it is automatically built along with the project. A specific build can be provided by setting the CMake flag `-DoneMath_DIR=/path/to/oneMath/install/lib/cmake/oneMath`.
Copy link
Collaborator

@NeoZhangJianyu NeoZhangJianyu Mar 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This description is not mandatory for Intel GPU.
Suggest move to NV/AMD GPU chapers.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved in 06fe2ca

Copy link
Collaborator

@NeoZhangJianyu NeoZhangJianyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added several small comments.
There is a common issue in all changed CPP files:
There are still calling oneapi::math for Intel GPU.
Should it rollback to "mkl" for Intel GPU?

target_compile_definitions(ggml-sycl PRIVATE GGML_SYCL_AMD)
else()
# Fallback to oneMath runtime dispatcher
target_link_libraries(ggml-sycl PRIVATE ONEMATH::onemath)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This path is for Intel in fact.
Please update to use oneMKL .

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto data_a = get_memory<const Ta>(a);
auto data_b = get_memory<const Tb>(b);
auto data_c = get_memory<Tc>(c);
oneapi::math::blas::column_major::gemm(get_onemath_backend(q), a_trans, b_trans, m, n, k, alpha_value, data_a,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it rollback to "mkl" for Intel GPU?

Copy link
Collaborator

@NeoZhangJianyu NeoZhangJianyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's great work!
Thank you!

@NeoZhangJianyu NeoZhangJianyu merged commit 8293970 into ggml-org:master Apr 1, 2025
48 checks passed
@Rbiessy
Copy link
Collaborator Author

Rbiessy commented Apr 1, 2025

Thank you :)

@Rbiessy Rbiessy deleted the romain/rename_onemath branch April 1, 2025 08:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation examples ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants