k quant #2169

jiafatom · 2025-04-10T17:53:55Z

Type of Change

feature

Description

A new quantization algorithm - k-quant, based on llama.cpp
Ref: https://github.com./ggml-org/llama.cpp/blob/64eda5deb9859e87a020e56bab5d2f9ca956f1de/ggml/src/ggml-quants.c

Expected Behavior & Potential Risk

Better quantization accuracy.

How has this PR been tested?

Benchmark on onnx models

Dependency Change?

cupy https://cupy.dev/, torch

Signed-off-by: David Fan <[email protected]>

for more information, see https://pre-commit.ci

Copilot

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

neural_compressor/adaptor/ox_utils/weight_only.py

Signed-off-by: David Fan <[email protected]>

… into k_quant

for more information, see https://pre-commit.ci

… into k_quant

jiafatom · 2025-04-12T15:47:53Z

@yiliu30 I cannot access the detail info of Code-Scan, probably some lint problem? Could you please share more info? Thanks!

yiliu30 · 2025-04-13T05:08:08Z

@yiliu30 I cannot access the detail info of Code-Scan, probably some lint problem? Could you please share more info? Thanks!

@jiafatom The log is here, can you access it ? https://dev.azure.com/lpot-inc/neural-compressor/_build/results?buildId=39386&view=logs&j=c1e234ec-db76-5d40-e8f0-e1ad3ef905a3&t=83918737-5053-51d6-e407-c96cbd8cd604

I copied it here FYI.

 -----------------  Current pydocstyle cmd start --------------------------
 
pydocstyle --convention=google $line > /neural-compressor/.azure-pipelines/scripts/codeScan/pydocstyle/../scanLog/pydocstyle.log
 -----------------  Current pydocstyle cmd end --------------------------
 
 -----------------  Current log file output start --------------------------
/neural-compressor/neural_compressor/adaptor/ox_utils/weight_only.py:251 in public function `quant_tensor_k_quant_cpu`:
        D205: 1 blank line required between summary line and description (found 0)
/neural-compressor/neural_compressor/adaptor/ox_utils/weight_only.py:323 in public function `quant_tensor_k_quant_cuda`:
        D205: 1 blank line required between summary line and description (found 0)
 -----------------  Current log file output end --------------------------
 
Error!! Please Click on the artifact button to download and view DocStyle error details.
 

##[error]Bash exited with code '1'.

… into k_quant

jiafatom · 2025-04-13T13:39:05Z

@yiliu30 I cannot access the detail info of Code-Scan, probably some lint problem? Could you please share more info? Thanks!

@jiafatom The log is here, can you access it ? https://dev.azure.com/lpot-inc/neural-compressor/_build/results?buildId=39386&view=logs&j=c1e234ec-db76-5d40-e8f0-e1ad3ef905a3&t=83918737-5053-51d6-e407-c96cbd8cd604

I copied it here FYI.

 -----------------  Current pydocstyle cmd start --------------------------
 
pydocstyle --convention=google $line > /neural-compressor/.azure-pipelines/scripts/codeScan/pydocstyle/../scanLog/pydocstyle.log
 -----------------  Current pydocstyle cmd end --------------------------
 
 -----------------  Current log file output start --------------------------
/neural-compressor/neural_compressor/adaptor/ox_utils/weight_only.py:251 in public function `quant_tensor_k_quant_cpu`:
        D205: 1 blank line required between summary line and description (found 0)
/neural-compressor/neural_compressor/adaptor/ox_utils/weight_only.py:323 in public function `quant_tensor_k_quant_cuda`:
        D205: 1 blank line required between summary line and description (found 0)
 -----------------  Current log file output end --------------------------
 
Error!! Please Click on the artifact button to download and view DocStyle error details.
 

##[error]Bash exited with code '1'.

I don't have permission to access the link, but I install pydocstyle and checked the log, and fixed this. Thank you!

jiafatom · 2025-04-13T23:31:13Z

@yiliu30 What is the error for UT-Basic? I cannot access the log, thanks

yiliu30 · 2025-04-14T02:04:55Z

@XuehaoSun @chensuyue The logs should be public, right?

XuehaoSun · 2025-04-14T02:09:40Z

@XuehaoSun @chensuyue The logs should be public, right?

Yes

XuehaoSun · 2025-04-14T02:37:12Z

See here @yiliu30 @XuehaoSun I am not in the external contributor group.

I just tried it and it was accessible without logging into any account or any specific network environment, even mobile phones and iPads in the home network can access it normally. Can you choose not to log into any account?
https://dev.azure.com/lpot-inc/neural-compressor/_build/results?buildId=39393&view=logs&jobId=461a3442-9d84-5494-e465-40a939a41758&j=c6aa4c58-99e4-54e9-e3eb-cd322b75c938
I also checked the failed CI, and its reason for failure is not related to this PR

jiafatom · 2025-04-14T02:42:29Z

See here @yiliu30 @XuehaoSun I am not in the external contributor group.

I just tried it and it was accessible without logging into any account or any specific network environment, even mobile phones and iPads in the home network can access it normally. Can you choose not to log into any account? https://dev.azure.com/lpot-inc/neural-compressor/_build/results?buildId=39393&view=logs&jobId=461a3442-9d84-5494-e465-40a939a41758&j=c6aa4c58-99e4-54e9-e3eb-cd322b75c938 I also checked the failed CI, and its reason for failure is not related to this PR

@XuehaoSun ~~I tried several ways, but still at some point it asks me to log into an account, cannot do the guest login. Then I got the permission error shown as above.~~ I see, when I log out my account, I can see the logs.
I also want to discuss how we can proceed this PR, thanks!

XuehaoSun · 2025-04-15T01:41:07Z

See here @yiliu30 @XuehaoSun I am not in the external contributor group.

I just tried it and it was accessible without logging into any account or any specific network environment, even mobile phones and iPads in the home network can access it normally. Can you choose not to log into any account? https://dev.azure.com/lpot-inc/neural-compressor/_build/results?buildId=39393&view=logs&jobId=461a3442-9d84-5494-e465-40a939a41758&j=c6aa4c58-99e4-54e9-e3eb-cd322b75c938 I also checked the failed CI, and its reason for failure is not related to this PR

@XuehaoSun ~~I tried several ways, but still at some point it asks me to log into an account, cannot do the guest login. Then I got the permission error shown as above.~~ I see, when I log out my account, I can see the logs. I also want to discuss how we can proceed this PR, thanks!

You can ignore CI issues. You can notify me to merge it when there are no new commits and the reviewers have approved.

jiafatom · 2025-04-15T02:53:45Z

@yiliu30 could you please review the PR? Thank you!

yiliu30 · 2025-04-15T13:53:07Z

Hi @jiafatom, could elaborate the background more details, it would be great if you can provide the accuracy and perf for an end to end example.
@mengniwang95 is the owner of ORT backend, I requested her to review.

jiafatom · 2025-04-15T14:42:09Z

Hi @jiafatom, could elaborate the background more details, it would be great if you can provide the accuracy and perf for an end to end example. @mengniwang95 is the owner of ORT backend, I requested her to review.

Here are some discussions from llama.cpp giving the explanations about the idea of k-quant, I don't see any formal paper about it. We want to achieve comparable accuracy results with llama.cpp
ggml-org/llama.cpp#5063
ggml-org/llama.cpp#1684
https://www.reddit.com/r/LocalLLaMA/comments/1dved4c/comment/lbnvb1j/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
I just implement it in python using cupy (gpu accelerator).
For perf, it is faster than RTN with gpu acceleration. (cpu version is definitely much slower)

We are making several quantization algorithms as candidate for our quantization script,
https://github.com./microsoft/onnxruntime/blob/5125c527bc24fb4a9e9bd5c343f9bc7413d10a16/onnxruntime/python/tools/quantization/matmul_4bits_quantizer.py
I don't think there exists one algorithm that is best for all the cases, but we want to offer the users different algorithms to do quantization.

mengniwang95 · 2025-04-16T11:08:55Z

Hi @jiafatom , thank you for your contribution.
Have you used the function you contributed to test the accuracy of any model? Hope you can provide some real results of this function to make sure it can work well.
Furthermore, it would be better to take this k-quant as a new algorithm rather than implement it in rtn_quantize function.

mengniwang95 · 2025-04-18T05:33:07Z

Hi @jiafatom , ONNX backend in this repo is 2.x version and we only maintains it without developing new feature. The 3.x version ONNX backend quantization is in this repo: https://github.com./onnx/neural-compressor, it is recommended to create new feature in this repo. So, if you want to contribute it in ONNX quantization, could you raise a new PR in https://github.com./onnx/neural-compressor?

jiafatom · 2025-04-18T23:14:53Z

Hi @jiafatom , ONNX backend in this repo is 2.x version and we only maintains it without developing new feature. The 3.x version ONNX backend quantization is in this repo: https://github.com./onnx/neural-compressor, it is recommended to create new feature in this repo. So, if you want to contribute it in ONNX quantization, could you raise a new PR in https://github.com./onnx/neural-compressor?

Hi, @mengniwang95 I am confused. I still see PR merged into this repo https://github.com./intel/neural-compressor/
How you ensure the new PR in https://github.com./intel/neural-compressor/ also got into https://github.com./onnx/neural-compressor ? Are all the features in https://github.com./intel/neural-compressor/ already in https://github.com./onnx/neural-compressor ? What is onnx backend 2.x version and 3.x version? now onnx is 1.17.0 and onnxruntime is 1.21.0, where is 2.x and 3.x defined? If "ONNX backend in this repo is 2.x version and we only maintains it without developing new feature.", is it better to mention it explicitly in the repo readme? And I checked the code in onnx/neural-compressor, it is https://github.com./onnx/neural-compressor/blob/main/onnx_neural_compressor/algorithms/weight_only/rtn.py
Is this whole algorithm aligned with intel/neural-compressor? I also find https://github.com./onnx/neural-compressor/blob/main/onnx_neural_compressor/quantization/matmul_nbits_quantizer.py this seems to be from onnxruntime python script. Why you copy code here? If onnxruntime changes its code, how to ensure this code to be aligned?
cc @XuehaoSun

mengniwang95 · 2025-04-21T05:36:54Z

jiafatom · 2025-04-21T18:32:24Z

Hi @jiafatom , PRs merged into this repo https://github.com./intel/neural-compressor/ is mainly about quantization on Gaudi. All the ONNX features in https://github.com./intel/neural-compressor/ already in https://github.com./onnx/neural-compressor except mixed-precision, and https://github.com./onnx/neural-compressor has some new features. 2.x version and 3.x version are the version of neural-compressor, currently the latest released version of neural-compressor is 3.3.1, which means 3.x. ONNX related code in https://github.com./intel/neural-compressor/ is stoped developing new feature after version 2.6.1, the new 3.x API is in https://github.com./onnx/neural-compressor. https://github.com./onnx/neural-compressor/blob/main/onnx_neural_compressor/algorithms/weight_only/rtn.py is aligned with intel/neural-compressor. https://github.com./onnx/neural-compressor is the 3.x version based on https://github.com./intel/neural-compressor, and we want to keep the same API with onnxruntime, so we copy the code here.

@mengniwang95 Thanks for explanation! I found that there is lot of refactoring in the new repo, like the API change in rtn_quantize. It takes time for me to work aligned with the new repo. Can we merge this PR into this repo to ensure the k-quant algorithm can work in old repo? Thanks! cc @XuehaoSun

for more information, see https://pre-commit.ci

wenhuach21 · 2025-04-24T09:32:12Z

neural_compressor/adaptor/ox_utils/weight_only.py

+        factor = np.array([rrmin + rdelta * is_ + maxq - minq]).astype(data.dtype)[0]
+        mask = rmin != rmax
+        iscale_new[mask] = factor / (rmax[mask] - rmin[mask])
+        quant_data_new = np.clip(np.round(iscale_new * (data - rmin)), minq, maxq)  # (nb, group_size)


I think maybe there's an issue with the algorithm. Since GGUF supports float zero points, rmin is subtracted in this line. However, in INC, only integer zero points are supported, so I think rmin should be replaced by the zero point (zp).

k quant

4b610c7

Signed-off-by: David Fan <[email protected]>

jiafatom force-pushed the k_quant branch from 5aee8bb to 4b610c7 Compare April 10, 2025 17:54

[pre-commit.ci] auto fixes from pre-commit.com hooks

6015feb

for more information, see https://pre-commit.ci

yiliu30 requested a review from Copilot April 12, 2025 02:42

Copilot AI reviewed Apr 12, 2025

View reviewed changes

neural_compressor/adaptor/ox_utils/weight_only.py Outdated Show resolved Hide resolved

neural_compressor/adaptor/ox_utils/weight_only.py Outdated Show resolved Hide resolved

jiafatom and others added 6 commits April 12, 2025 15:21

k quant

c3318cf

Signed-off-by: David Fan <[email protected]>

Merge branch 'k_quant' of https://github.com./jiafatom/neural-compressor…

115dee1

… into k_quant

[pre-commit.ci] auto fixes from pre-commit.com hooks

1b3518a

for more information, see https://pre-commit.ci

Merge branch 'k_quant' of https://github.com./jiafatom/neural-compressor…

4542a33

… into k_quant

Merge branch 'k_quant' of https://github.com./jiafatom/neural-compressor…

d91b4e5

… into k_quant

Merge branch 'k_quant' of https://github.com./jiafatom/neural-compressor…

de4f7f0

… into k_quant

jiafatom force-pushed the k_quant branch from 7c10d5e to 932b214 Compare April 12, 2025 15:36

test

0a1a0d4

Merge branch 'k_quant' of https://github.com./jiafatom/neural-compressor…

99f10df

… into k_quant

jiafatom force-pushed the k_quant branch from 97d75dc to 99f10df Compare April 13, 2025 13:37

XuehaoSun requested a review from yiliu30 April 15, 2025 07:47

yiliu30 requested a review from mengniwang95 April 15, 2025 13:53

Merge branch 'int8_new' into k_quant

559e1d9

jiafatom force-pushed the k_quant branch from d1911a1 to 559e1d9 Compare April 23, 2025 03:21

[pre-commit.ci] auto fixes from pre-commit.com hooks

1fcacd9

for more information, see https://pre-commit.ci

wenhuach21 reviewed Apr 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

k quant #2169

k quant #2169

jiafatom commented Apr 10, 2025

Copilot AI left a comment

jiafatom commented Apr 12, 2025

yiliu30 commented Apr 13, 2025

jiafatom commented Apr 13, 2025

jiafatom commented Apr 13, 2025

yiliu30 commented Apr 14, 2025

XuehaoSun commented Apr 14, 2025

XuehaoSun commented Apr 14, 2025

jiafatom commented Apr 14, 2025 •

edited

Loading

XuehaoSun commented Apr 15, 2025

jiafatom commented Apr 15, 2025

yiliu30 commented Apr 15, 2025

jiafatom commented Apr 15, 2025 •

edited

Loading

mengniwang95 commented Apr 16, 2025

mengniwang95 commented Apr 18, 2025

jiafatom commented Apr 18, 2025 •

edited

Loading

mengniwang95 commented Apr 21, 2025

jiafatom commented Apr 21, 2025

wenhuach21 Apr 24, 2025

k quant #2169

Are you sure you want to change the base?

k quant #2169

Conversation

jiafatom commented Apr 10, 2025

Type of Change

Description

Expected Behavior & Potential Risk

How has this PR been tested?

Dependency Change?

Copilot AI left a comment

Choose a reason for hiding this comment

jiafatom commented Apr 12, 2025

yiliu30 commented Apr 13, 2025

jiafatom commented Apr 13, 2025

jiafatom commented Apr 13, 2025

yiliu30 commented Apr 14, 2025

XuehaoSun commented Apr 14, 2025

XuehaoSun commented Apr 14, 2025

jiafatom commented Apr 14, 2025 • edited Loading

XuehaoSun commented Apr 15, 2025

jiafatom commented Apr 15, 2025

yiliu30 commented Apr 15, 2025

jiafatom commented Apr 15, 2025 • edited Loading

mengniwang95 commented Apr 16, 2025

mengniwang95 commented Apr 18, 2025

jiafatom commented Apr 18, 2025 • edited Loading

mengniwang95 commented Apr 21, 2025

jiafatom commented Apr 21, 2025

wenhuach21 Apr 24, 2025

Choose a reason for hiding this comment

jiafatom commented Apr 14, 2025 •

edited

Loading

jiafatom commented Apr 15, 2025 •

edited

Loading

jiafatom commented Apr 18, 2025 •

edited

Loading