- 
  Notifications
 You must be signed in to change notification settings 
- Fork 13.4k
Can madlad400 gguf models from huggingface be used? #8300
-
I compiled the latest version which has T5 support and tried running a madlad400 model from https://huggingface.co/jbochi/madlad400-3b-mt/resolve/main/model-q4k.gguf 
but it gave a
lama_model_load:` error loading model: error loading model architecture: unknown model architecture: ''
Is there a change in the conversion process from .safetensors which is needed for T5 models?
and if there is, is there a gguf-my-repo recent enough to support T5 models?
my version
./llama-cli --version
version: 1 (807b0c4)
built with cc (GCC) 11.2.0 for x86_64-slackware-linux
Beta Was this translation helpful? Give feedback.
All reactions
- 
 👍 1
Okay I got it working now (I think), and man it feels FAST!!
The bad news is that my GGUF conversion procedure from jbochi => llama.cpp was quite a messy business indeed.
It involved conjuring up an empty GGUF, filling it with metadata and doing some frankensteining with KerfuffleV2's gguf-tools. 
I also wrote a custom script to rename the tensors, and llama.cpp itself needed a teeny weeny change too.
The upside of this method is that the quantized tensors remain untouched.
I can give more details if there's interest but somehow I feel there must be a better way :D
EDIT: I've now managed to polish the conversion process a little bit, so that no llama.cpp customization is necessary any longer.
Replies: 10 comments 19 replies
-
Well, I couldn't figure out a way to use jbochi's gguf directly either.
If you take a look at what gguf_dump.py says, there are some sections missing,
such as general.architecture that it's complaining about, as well as tokenizer.ggml.tokens and who knows what else.
I suppose that stuff is available, it's just that it's not in the .gguf itself but in separate files (which means that it's no use for llama.cpp?).
So I think it's necessary to use the conversion script convert_hf_to_gguf.py.
However my understanding is that it would be better NOT to use a quantized model as a source if you can avoid it.
Also, you may need to download one additional file from HF called spiece.model if you don't have it already.
(I would like to give a little bit more specific info, but haven't been able to try this myself, and won't be for some time...)
Btw super enthused about these recent additions, this project just keeps on getting better :D
I've previously used the smallest MADLAD400 model with candle and while it works pretty well, there are some strange glitches that I haven't been able to weed out.
So it'll be interesting to see how llama.cpp handles it.
EDIT: OK so this behaves somewhat similarly to candle, but the glitches are slightly different.
It's a helluva lot faster in llama.cpp for me, probably because candle (or rather, gemm) didn't have full support for my CPU.
Beta Was this translation helpful? Give feedback.
All reactions
-
Okay I got it working now (I think), and man it feels FAST!!
The bad news is that my GGUF conversion procedure from jbochi => llama.cpp was quite a messy business indeed.
It involved conjuring up an empty GGUF, filling it with metadata and doing some frankensteining with KerfuffleV2's gguf-tools. 
I also wrote a custom script to rename the tensors, and llama.cpp itself needed a teeny weeny change too.
The upside of this method is that the quantized tensors remain untouched.
I can give more details if there's interest but somehow I feel there must be a better way :D
EDIT: I've now managed to polish the conversion process a little bit, so that no llama.cpp customization is necessary any longer.
Here's the patch if anyone wants to try this version. You'll need the original jbochi model and xdelta3.
MADLAD400_GGUF_patch.tar.gz 
Beta Was this translation helpful? Give feedback.
All reactions
-
Yeah, I guess there's no other q3k quants readily available at HF? I think I saw some q2k versions though.
I've now located the scripts & I'll upload them here once I have prepared some instructions for their usage
(it's a bit of a long and boring yarn, I'm afraid).
And I think it's better to do a little bit of testing too, as I'm not sure how much has been broken in a year...
Stay tuned & watch this space :D
Beta Was this translation helpful? Give feedback.
All reactions
-
sure, tysm for what you are doing @misutoneko. if it takes unusual time lmk i will be ready to help in anything
Beta Was this translation helpful? Give feedback.
All reactions
-
Okay, here you go:
candle2llamacpp_gguf_conversion.tar.gz 
The breakage wasn't too bad. There's a bit of an issue with the conversion script convert_hf_to_gguf.py in that a current version can't be used. There's a July 2024 version included in the package which seems to work, but it might be beneficial to get that fixed. There's been quite a few changes in that conversion script :D
As for license, probably this never becomes an issue but just in case... it's the same as llama.cpp.
(I'd be ok with public domain, but I guess it's easier not to change that.)
Beta Was this translation helpful? Give feedback.
All reactions
-
@misutoneko Thank you so much for sharing the scripts!
I successfully used your conversion process and it worked perfectly! The Q3K model is now fully compatible with llama.cpp and working great.
Beta Was this translation helpful? Give feedback.
All reactions
-
Great to hear it's working for you.
That fallocate trick employed in the scripts would be quite useful for GGUFs in general.
It enables editing the headers in-place, without any huge temporary files.
The downside is that it's limited to Linux use only, and it would add some (more!) complexity.
Also the current GGUF format isn't ideal for its use, as you can see in these scripts...
But I suppose the format could be changed to better support fallocing, if needed.
Beta Was this translation helpful? Give feedback.
All reactions
-
Did anyone made a llama.cpp compatible gguf for another T5 model, aya-101: https://huggingface.co/CohereForAI/aya-101 ?
Beta Was this translation helpful? Give feedback.
All reactions
-
There are GGUF's for candle available here:
https://huggingface.co/kcoopermiller/aya-101-GGUF 
There was some comment about them producing gibberish though...
If they do work with candle, it may be possible to convert them the same way as the MADLAD400 ones.
EDIT: OK I've now tested the smallest aya-101 GGUF (Q2K) in both candle and llama.cpp.
Yeah it confirms positive for gibberish, unfortunately.
Beta Was this translation helpful? Give feedback.
All reactions
- 
 👍 1
-
Oh wow, there they are, popping up at HF now:
https://huggingface.co/models?dataset=dataset:allenai%2FMADLAD-400&sort=created 
Beta Was this translation helpful? Give feedback.
All reactions
- 
 👍 1
-
OK just got aya-101 working!
The catch is that you have to quantize it yourself.
I don't know what exactly is wrong with the kcoopermiller ones, but they just refused to do any useful work for me.
My newly quantized Q2K GGUF is also a lot larger for some reason.
(EDIT: It seems a lot of the tensors got quantized into Q3K, so that explains the size difference).
I wanted to test quantizing a large model with meager resources and this was as good a candidate as any.
So in the end I got the answer I was looking for: Yes, you can quantize 13B models even if all you have is a C2D and 4GB of memory :D
The process took a couple of hours (and it ate over 100GB of hd, nom nom nom).
(Of course, "large" is relative...in the era of 405B this is peanuts really :)
build/bin/llama-cli -m models/aya-101.Q2_K.gguf -p 'Translate to Finnish: I wanted to test quantizing a large model with meager resources and this was as good a candidate as any.'
Minä halusin testata suurta mallia vähäisillä resursseilla ja tämä oli erinomainen ehdokas. [end of text]
Beta Was this translation helpful? Give feedback.
All reactions
- 
 ❤️ 1
-
aya-101 is missing the spiece.model file which is needed to convert it. I copied the one from mt5-xxl which enabled the conversion to work and created IQ4_XS quant.
bash-5.1$ lm "translate to finnish: I wanted to test quantizing a large model with meager resources and this was as good a candidate as any."
Halusin testata suurten mallien kvantifiointia vähäisillä resursseilla ja tämä oli yhtä hyvä kandidaatti kuin mikä tahansa.
bash-5.1$ lm "translate to english: Halusin testata suurten mallien kvantifiointia vähäisillä resursseilla ja tämä oli yhtä hyvä kandidaatti kuin mikä tahansa."
I wanted to test the quantification of large models with limited resources and this was as good a candidate as any.
bash-5.1$ lm "translate to english: Minä halusin testata suurta mallia vähäisillä resursseilla ja tämä oli erinomainen ehdokas"
I wanted to test a large model with limited resources and this was a perfect candidate.
The model is pretty dumb, looks mainly useful for translations:
lm "Answer the following yes/no question by reasoning step-by-step. Could a dandelion suffer from hepatitis?"
Dandelion is a species of flowering plant. Hepatitis is a disease caused by hepatitis B virus. Therefore, the final answer is yes.
Translated question with madlad400 to German, same answer.
bash-5.1$ lm "Beantworten Sie die folgende Ja/Nein-Frage schrittweise: Könnte ein Löwenzahn an Hepatitis leiden?"
Hepatitis ist eine Viruserkrankung, die durch eine Infektion mit Hepatitis A verursacht wird. Löwenzahn ist ein Gemüse. Die endgültige Antwort lautet also ja.
Beta Was this translation helpful? Give feedback.
All reactions
- 
 ❤️ 1
-
Gotta agree on dumb :D
Also prone to looping but maybe I'm just not holding it right.
IQ4_XS, you say? I wonder how that imatrix thing is handled in these multilingual models.
EDIT: Oh I see, so IQ4_XS doesn't mandate imatrix. And you can actually use imatrix with other Q variants too...
Btw in case anyone's wondering, yes you can run this on said C2D/4GB machine. Well, it's more of a crawl though.
vvv Thanks vvv --repeat-penalty 2.0 and leveling up to IQ4_XS mitigated the looping problem, but not all the way.
Beta Was this translation helpful? Give feedback.
All reactions
- 
 👍 1
-
Gotta agree on dumb :D Also prone to looping but maybe I'm just not holding it right.
IQ4_XS, you say? I wonder how that imatrix thing is handled in these multilingual models.
I dont use imatrix. I haven't seen it loop. I use greedy sample temp=0 rep=1, and see no problems. I had to use madlad400 to translate a question or it obstinately just tries to answer the question in source language instead of doing the translation no matter how I prompt it.
Beta Was this translation helpful? Give feedback.
All reactions
- 
 👍 1
-
I tried this: https://huggingface.co/Eddishockwave/madlad400-10b-mt-Q8_0-GGUF
It works producing quite good results anyhow.
Beta Was this translation helpful? Give feedback.
All reactions
- 
 👍 1
-
I got it working with llama-cli -p '<2pt> What are you doing?' but not with llama-cli -i and not with llama-server's /completions.
I've noticed that llama-server is setting a chat template and llama-cli isn't.
In interactive mode, it doesn't return anything and in API mode it returns "content": "?"
Anybody knows why? Using code from the latest git.
Beta Was this translation helpful? Give feedback.
All reactions
- 
 👍 1
-
There is still an issue with it. I am working to investigate
Beta Was this translation helpful? Give feedback.
All reactions
-
I made the following test:
LLAMASERVER_URI = 'http://127.0.0.1:8080/' LLAMASERVER_TEMPLATE_ENDPOINT = LLAMASERVER_URI + 'apply-template' payload = { 'messages': [ {"role": "system", "content": "<2es>"}, {"role": "user", "content":"String to translate"} ] } with requests.post(LLAMASERVER_TEMPLATE_ENDPOINT, json=payload, stream=False) as response: if response.status_code == 200: document = json.loads(response.text)['prompt'] print(document)
...and obtained this:
srv log_server_r: request: POST /apply-template 127.0.0.1 200
<|im_start|>system
<2es><|im_end|>
<|im_start|>user
String to translate<|im_end|>
<|im_start|>assistant
So, there is a default template. If there is a way to turn it off, the model should work
Beta Was this translation helpful? Give feedback.
All reactions
-
It also produces CUDA error when the prompt (text) fills over ~45% of the context length (Windows x64, CUDA 12.1). The support of MADLAD for now seems not pleasant
Beta Was this translation helpful? Give feedback.
All reactions
-
@misutoneko Hi, sir. Could you complete the example in llama.swiftui using T5? I tried using the madlad400 model in Swift, but I got an error: llama.cpp:15664: GGML_ASSERT(n_outputs_enc > 0 && "call llama_encode() first") failed. The code in llama.cpp is quite complex for me.
Beta Was this translation helpful? Give feedback.
All reactions
-
No, sorry. I've never used swift for anything so no clue on that. Does it work with llama-cli?
The error message suggests that the call to llama_encode() failed somehow.
You could try adding some logging and compare with what llama-cli does (examples/main/main.cpp).
Beta Was this translation helpful? Give feedback.
All reactions
-
No, sorry. I've never used swift for anything so no clue on that. Does it work with llama-cli?
The error message suggests that the call to llama_encode() failed somehow. You could try adding some logging and compare with what llama-cli does (examples/main/main.cpp).
Fortunately, I found similar code in batched.cpp and attempted to submit a PR. Thank you!
Beta Was this translation helpful? Give feedback.