-
Notifications
You must be signed in to change notification settings - Fork 13.4k
-
Ascends NPUs seems to be a great alternative (to Macstudio and epyc) to run quantized R1.
For example: Atlas 300I Duo offers 140TFLOPS fp16 408GB/s mem bandwidth + 96G Vram.
2 of this card onto a PC could run the quantized 671B R1 relatively well I would say.
However, as shown in https://github.com/ggerganov/llama.cpp/blob/master/docs/backend/CANN.md, there is no deepseek architecture support yet, and low bit quantization seems to be not validated yet.
@hipudding Do you have plan on porting low-bit quantized R1 to Ascend cards, via gguf-cann backend?
That seems a pretty valid use case to me...
Beta Was this translation helpful? Give feedback.
All reactions
Replies: 2 comments
-
@hipudding is there something we can help you with to make it happen?
Beta Was this translation helpful? Give feedback.
All reactions
-
Thank you for your interest in Ascend.
Currently, the best-supported format for running llama.cpp on Ascend is FP16, with partial support for Q8_0 and Q4_0 on certain device.
However, based on actual testing, the execution efficiency of quantized operators is not very high — in some cases, even lower than FP16. In addition, hardware support for 4-bit or lower-bit quantization is not yet available for all devices.
If you want to enable quantized formats, I believe q8(q8_0,q8_1,q8_k_m) and q4(q4_0,q4_1,q4_k_m) are feasible. It would only require implementing the quantized versions of GGML_OP_GET_ROWS, GGML_OP_MUL_MAT, and GGML_OP_MUL_MAT_ID.
We are also looking forward to seeing excellent inference performance on quantized models with Ascend in the future.
Beta Was this translation helpful? Give feedback.