-
Notifications
You must be signed in to change notification settings - Fork 266
k quant #2169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
k quant #2169
Conversation
Signed-off-by: David Fan <[email protected]>
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.
@yiliu30 I cannot access the detail info of Code-Scan, probably some lint problem? Could you please share more info? Thanks! |
@jiafatom The log is here, can you access it ? https://dev.azure.com/lpot-inc/neural-compressor/_build/results?buildId=39386&view=logs&j=c1e234ec-db76-5d40-e8f0-e1ad3ef905a3&t=83918737-5053-51d6-e407-c96cbd8cd604 I copied it here FYI. ----------------- Current pydocstyle cmd start --------------------------
pydocstyle --convention=google $line > /neural-compressor/.azure-pipelines/scripts/codeScan/pydocstyle/../scanLog/pydocstyle.log
----------------- Current pydocstyle cmd end --------------------------
----------------- Current log file output start --------------------------
/neural-compressor/neural_compressor/adaptor/ox_utils/weight_only.py:251 in public function `quant_tensor_k_quant_cpu`:
D205: 1 blank line required between summary line and description (found 0)
/neural-compressor/neural_compressor/adaptor/ox_utils/weight_only.py:323 in public function `quant_tensor_k_quant_cuda`:
D205: 1 blank line required between summary line and description (found 0)
----------------- Current log file output end --------------------------
Error!! Please Click on the artifact button to download and view DocStyle error details.
##[error]Bash exited with code '1'.
|
I don't have permission to access the link, but I install pydocstyle and checked the log, and fixed this. Thank you! |
@yiliu30 What is the error for UT-Basic? I cannot access the log, thanks |
@XuehaoSun @chensuyue The logs should be public, right? |
Yes |
I just tried it and it was accessible without logging into any account or any specific network environment, even mobile phones and iPads in the home network can access it normally. Can you choose not to log into any account? |
@XuehaoSun |
You can ignore CI issues. You can notify me to merge it when there are no new commits and the reviewers have approved. |
@yiliu30 could you please review the PR? Thank you! |
Hi @jiafatom, could elaborate the background more details, it would be great if you can provide the accuracy and perf for an end to end example. |
Here are some discussions from llama.cpp giving the explanations about the idea of k-quant, I don't see any formal paper about it. We want to achieve comparable accuracy results with llama.cpp We are making several quantization algorithms as candidate for our quantization script, |
Hi @jiafatom , thank you for your contribution. |
Hi @jiafatom , ONNX backend in this repo is 2.x version and we only maintains it without developing new feature. The 3.x version ONNX backend quantization is in this repo: https://github.com./onnx/neural-compressor, it is recommended to create new feature in this repo. So, if you want to contribute it in ONNX quantization, could you raise a new PR in https://github.com./onnx/neural-compressor? |
Hi, @mengniwang95 I am confused. I still see PR merged into this repo https://github.com./intel/neural-compressor/ |
@mengniwang95 Thanks for explanation! I found that there is lot of refactoring in the new repo, like the API change in rtn_quantize. It takes time for me to work aligned with the new repo. Can we merge this PR into this repo to ensure the k-quant algorithm can work in old repo? Thanks! cc @XuehaoSun |
for more information, see https://pre-commit.ci
factor = np.array([rrmin + rdelta * is_ + maxq - minq]).astype(data.dtype)[0] | ||
mask = rmin != rmax | ||
iscale_new[mask] = factor / (rmax[mask] - rmin[mask]) | ||
quant_data_new = np.clip(np.round(iscale_new * (data - rmin)), minq, maxq) # (nb, group_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think maybe there's an issue with the algorithm. Since GGUF supports float zero points, rmin is subtracted in this line. However, in INC, only integer zero points are supported, so I think rmin should be replaced by the zero point (zp).
Type of Change
feature
Description
A new quantization algorithm - k-quant, based on llama.cpp
Ref: https://github.com./ggml-org/llama.cpp/blob/64eda5deb9859e87a020e56bab5d2f9ca956f1de/ggml/src/ggml-quants.c
Expected Behavior & Potential Risk
Better quantization accuracy.
How has this PR been tested?
Benchmark on onnx models
Dependency Change?
cupy https://cupy.dev/, torch