Skip to content

k quant #2169

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: master
Choose a base branch
from
Open

k quant #2169

wants to merge 12 commits into from

Conversation

jiafatom
Copy link

Type of Change

feature

Description

A new quantization algorithm - k-quant, based on llama.cpp
Ref: https://github.com./ggml-org/llama.cpp/blob/64eda5deb9859e87a020e56bab5d2f9ca956f1de/ggml/src/ggml-quants.c

Expected Behavior & Potential Risk

Better quantization accuracy.

How has this PR been tested?

Benchmark on onnx models

Dependency Change?

cupy https://cupy.dev/, torch

Signed-off-by: David Fan <[email protected]>
@yiliu30 yiliu30 requested a review from Copilot April 12, 2025 02:42
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

@jiafatom
Copy link
Author

@yiliu30 I cannot access the detail info of Code-Scan, probably some lint problem? Could you please share more info? Thanks!

@yiliu30
Copy link
Contributor

yiliu30 commented Apr 13, 2025

@yiliu30 I cannot access the detail info of Code-Scan, probably some lint problem? Could you please share more info? Thanks!

@jiafatom The log is here, can you access it ? https://dev.azure.com/lpot-inc/neural-compressor/_build/results?buildId=39386&view=logs&j=c1e234ec-db76-5d40-e8f0-e1ad3ef905a3&t=83918737-5053-51d6-e407-c96cbd8cd604

I copied it here FYI.

 -----------------  Current pydocstyle cmd start --------------------------
 
pydocstyle --convention=google $line > /neural-compressor/.azure-pipelines/scripts/codeScan/pydocstyle/../scanLog/pydocstyle.log
 -----------------  Current pydocstyle cmd end --------------------------
 
 -----------------  Current log file output start --------------------------
/neural-compressor/neural_compressor/adaptor/ox_utils/weight_only.py:251 in public function `quant_tensor_k_quant_cpu`:
        D205: 1 blank line required between summary line and description (found 0)
/neural-compressor/neural_compressor/adaptor/ox_utils/weight_only.py:323 in public function `quant_tensor_k_quant_cuda`:
        D205: 1 blank line required between summary line and description (found 0)
 -----------------  Current log file output end --------------------------
 
Error!! Please Click on the artifact button to download and view DocStyle error details.
 

##[error]Bash exited with code '1'.

@jiafatom
Copy link
Author

@yiliu30 I cannot access the detail info of Code-Scan, probably some lint problem? Could you please share more info? Thanks!

@jiafatom The log is here, can you access it ? https://dev.azure.com/lpot-inc/neural-compressor/_build/results?buildId=39386&view=logs&j=c1e234ec-db76-5d40-e8f0-e1ad3ef905a3&t=83918737-5053-51d6-e407-c96cbd8cd604

I copied it here FYI.

 -----------------  Current pydocstyle cmd start --------------------------
 
pydocstyle --convention=google $line > /neural-compressor/.azure-pipelines/scripts/codeScan/pydocstyle/../scanLog/pydocstyle.log
 -----------------  Current pydocstyle cmd end --------------------------
 
 -----------------  Current log file output start --------------------------
/neural-compressor/neural_compressor/adaptor/ox_utils/weight_only.py:251 in public function `quant_tensor_k_quant_cpu`:
        D205: 1 blank line required between summary line and description (found 0)
/neural-compressor/neural_compressor/adaptor/ox_utils/weight_only.py:323 in public function `quant_tensor_k_quant_cuda`:
        D205: 1 blank line required between summary line and description (found 0)
 -----------------  Current log file output end --------------------------
 
Error!! Please Click on the artifact button to download and view DocStyle error details.
 

##[error]Bash exited with code '1'.

I don't have permission to access the link, but I install pydocstyle and checked the log, and fixed this. Thank you!

@jiafatom
Copy link
Author

@yiliu30 What is the error for UT-Basic? I cannot access the log, thanks

@yiliu30
Copy link
Contributor

yiliu30 commented Apr 14, 2025

@XuehaoSun @chensuyue The logs should be public, right?

@XuehaoSun
Copy link
Contributor

@XuehaoSun @chensuyue The logs should be public, right?

Yes

@XuehaoSun
Copy link
Contributor

See here @yiliu30 @XuehaoSun I am not in the external contributor group. intel_azure

I just tried it and it was accessible without logging into any account or any specific network environment, even mobile phones and iPads in the home network can access it normally. Can you choose not to log into any account?
https://dev.azure.com/lpot-inc/neural-compressor/_build/results?buildId=39393&view=logs&jobId=461a3442-9d84-5494-e465-40a939a41758&j=c6aa4c58-99e4-54e9-e3eb-cd322b75c938
I also checked the failed CI, and its reason for failure is not related to this PR

@jiafatom
Copy link
Author

jiafatom commented Apr 14, 2025

See here @yiliu30 @XuehaoSun I am not in the external contributor group. intel_azure

I just tried it and it was accessible without logging into any account or any specific network environment, even mobile phones and iPads in the home network can access it normally. Can you choose not to log into any account? https://dev.azure.com/lpot-inc/neural-compressor/_build/results?buildId=39393&view=logs&jobId=461a3442-9d84-5494-e465-40a939a41758&j=c6aa4c58-99e4-54e9-e3eb-cd322b75c938 I also checked the failed CI, and its reason for failure is not related to this PR

@XuehaoSun I tried several ways, but still at some point it asks me to log into an account, cannot do the guest login. Then I got the permission error shown as above. I see, when I log out my account, I can see the logs.
I also want to discuss how we can proceed this PR, thanks!

@XuehaoSun
Copy link
Contributor

See here @yiliu30 @XuehaoSun I am not in the external contributor group. intel_azure

I just tried it and it was accessible without logging into any account or any specific network environment, even mobile phones and iPads in the home network can access it normally. Can you choose not to log into any account? https://dev.azure.com/lpot-inc/neural-compressor/_build/results?buildId=39393&view=logs&jobId=461a3442-9d84-5494-e465-40a939a41758&j=c6aa4c58-99e4-54e9-e3eb-cd322b75c938 I also checked the failed CI, and its reason for failure is not related to this PR

@XuehaoSun I tried several ways, but still at some point it asks me to log into an account, cannot do the guest login. Then I got the permission error shown as above. I see, when I log out my account, I can see the logs. I also want to discuss how we can proceed this PR, thanks!

You can ignore CI issues. You can notify me to merge it when there are no new commits and the reviewers have approved.

@jiafatom
Copy link
Author

@yiliu30 could you please review the PR? Thank you!

@XuehaoSun XuehaoSun requested a review from yiliu30 April 15, 2025 07:47
@yiliu30
Copy link
Contributor

yiliu30 commented Apr 15, 2025

Hi @jiafatom, could elaborate the background more details, it would be great if you can provide the accuracy and perf for an end to end example.
@mengniwang95 is the owner of ORT backend, I requested her to review.

@yiliu30 yiliu30 requested a review from mengniwang95 April 15, 2025 13:53
@jiafatom
Copy link
Author

jiafatom commented Apr 15, 2025

Hi @jiafatom, could elaborate the background more details, it would be great if you can provide the accuracy and perf for an end to end example. @mengniwang95 is the owner of ORT backend, I requested her to review.

Here are some discussions from llama.cpp giving the explanations about the idea of k-quant, I don't see any formal paper about it. We want to achieve comparable accuracy results with llama.cpp
ggml-org/llama.cpp#5063
ggml-org/llama.cpp#1684
https://www.reddit.com/r/LocalLLaMA/comments/1dved4c/comment/lbnvb1j/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
I just implement it in python using cupy (gpu accelerator).
For perf, it is faster than RTN with gpu acceleration. (cpu version is definitely much slower)

We are making several quantization algorithms as candidate for our quantization script,
https://github.com./microsoft/onnxruntime/blob/5125c527bc24fb4a9e9bd5c343f9bc7413d10a16/onnxruntime/python/tools/quantization/matmul_4bits_quantizer.py
I don't think there exists one algorithm that is best for all the cases, but we want to offer the users different algorithms to do quantization.

@mengniwang95
Copy link
Contributor

Hi @jiafatom , thank you for your contribution.
Have you used the function you contributed to test the accuracy of any model? Hope you can provide some real results of this function to make sure it can work well.
Furthermore, it would be better to take this k-quant as a new algorithm rather than implement it in rtn_quantize function.

@mengniwang95
Copy link
Contributor

Hi @jiafatom , ONNX backend in this repo is 2.x version and we only maintains it without developing new feature. The 3.x version ONNX backend quantization is in this repo: https://github.com./onnx/neural-compressor, it is recommended to create new feature in this repo. So, if you want to contribute it in ONNX quantization, could you raise a new PR in https://github.com./onnx/neural-compressor?

@jiafatom
Copy link
Author

jiafatom commented Apr 18, 2025

Hi @jiafatom , ONNX backend in this repo is 2.x version and we only maintains it without developing new feature. The 3.x version ONNX backend quantization is in this repo: https://github.com./onnx/neural-compressor, it is recommended to create new feature in this repo. So, if you want to contribute it in ONNX quantization, could you raise a new PR in https://github.com./onnx/neural-compressor?

Hi, @mengniwang95 I am confused. I still see PR merged into this repo https://github.com./intel/neural-compressor/
How you ensure the new PR in https://github.com./intel/neural-compressor/ also got into https://github.com./onnx/neural-compressor ? Are all the features in https://github.com./intel/neural-compressor/ already in https://github.com./onnx/neural-compressor ? What is onnx backend 2.x version and 3.x version? now onnx is 1.17.0 and onnxruntime is 1.21.0, where is 2.x and 3.x defined? If "ONNX backend in this repo is 2.x version and we only maintains it without developing new feature.", is it better to mention it explicitly in the repo readme? And I checked the code in onnx/neural-compressor, it is https://github.com./onnx/neural-compressor/blob/main/onnx_neural_compressor/algorithms/weight_only/rtn.py
Is this whole algorithm aligned with intel/neural-compressor? I also find https://github.com./onnx/neural-compressor/blob/main/onnx_neural_compressor/quantization/matmul_nbits_quantizer.py this seems to be from onnxruntime python script. Why you copy code here? If onnxruntime changes its code, how to ensure this code to be aligned?
cc @XuehaoSun

@mengniwang95
Copy link
Contributor

Hi @jiafatom , PRs merged into this repo https://github.com./intel/neural-compressor/ is mainly about quantization on Gaudi.
All the ONNX features in https://github.com./intel/neural-compressor/ already in https://github.com./onnx/neural-compressor except mixed-precision, and https://github.com./onnx/neural-compressor has some new features.
2.x version and 3.x version are the version of neural-compressor, currently the latest released version of neural-compressor is 3.3.1, which means 3.x. ONNX related code in https://github.com./intel/neural-compressor/ is stoped developing new feature after version 2.6.1, the new 3.x API is in https://github.com./onnx/neural-compressor.
https://github.com./onnx/neural-compressor/blob/main/onnx_neural_compressor/algorithms/weight_only/rtn.py
is aligned with intel/neural-compressor.
https://github.com./onnx/neural-compressor is the 3.x version based on https://github.com./intel/neural-compressor, and we want to keep the same API with onnxruntime, so we copy the code here.

@jiafatom
Copy link
Author

Hi @jiafatom , PRs merged into this repo https://github.com./intel/neural-compressor/ is mainly about quantization on Gaudi. All the ONNX features in https://github.com./intel/neural-compressor/ already in https://github.com./onnx/neural-compressor except mixed-precision, and https://github.com./onnx/neural-compressor has some new features. 2.x version and 3.x version are the version of neural-compressor, currently the latest released version of neural-compressor is 3.3.1, which means 3.x. ONNX related code in https://github.com./intel/neural-compressor/ is stoped developing new feature after version 2.6.1, the new 3.x API is in https://github.com./onnx/neural-compressor. https://github.com./onnx/neural-compressor/blob/main/onnx_neural_compressor/algorithms/weight_only/rtn.py is aligned with intel/neural-compressor. https://github.com./onnx/neural-compressor is the 3.x version based on https://github.com./intel/neural-compressor, and we want to keep the same API with onnxruntime, so we copy the code here.

@mengniwang95 Thanks for explanation! I found that there is lot of refactoring in the new repo, like the API change in rtn_quantize. It takes time for me to work aligned with the new repo. Can we merge this PR into this repo to ensure the k-quant algorithm can work in old repo? Thanks! cc @XuehaoSun

factor = np.array([rrmin + rdelta * is_ + maxq - minq]).astype(data.dtype)[0]
mask = rmin != rmax
iscale_new[mask] = factor / (rmax[mask] - rmin[mask])
quant_data_new = np.clip(np.round(iscale_new * (data - rmin)), minq, maxq) # (nb, group_size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe there's an issue with the algorithm. Since GGUF supports float zero points, rmin is subtracted in this line. However, in INC, only integer zero points are supported, so I think rmin should be replaced by the zero point (zp).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants