- 
  Notifications
 You must be signed in to change notification settings 
- Fork 13.4k
Roadmap May 2023 #1220
-
High-prio
- 
Refactoring pass There is a lot of code duplication in ggml.cwhich probably can be simplified with a good set of macros. The goal is to keep the code size manageable, while we avoid reaching "macro hell"
- 
Optimize the AVX / AVX2 implementations of the quantization methods and add WASM SIMD Make sure we have optimal implementation for these instruction sets 
- 
Apply the new integer quantization methods to whisper.cpp Will backport the latest ggmlversion towhisper.cppand add support for quantized models. Will also update all WASM examples to be able to run with the quantized modelsUpdate: whisper.cpp v1.4.0 has been released. It includes integer quantization and GPU support via cuBLAS - all thanks to the great work done here 
- 
Add support for "batch inference" Recently, the bert.cpp (by @skeskinen) project demonstrated BERT inference using ggml. This model gains a lot from batch inference, which is currently not supported byggml. We will extend all operators to support it. Thebert.cppexample will serve as a playground to achieve thisUpdate: batched forward passes have been demonstrated in the baby-llama example (thanks to @xaedes Implement backward passes for llama with small training llama from scratch example #1360 ). It will be great to apply the demonstrated approach to bert.cppandwhisper.cpp's beam-search decoding in order to gain extra speed-up
- 
Implement inference of new models There are already some very interesting models that should be supported by ggml:- 💫 StarCoder
- Segment Anything Model (SAM)
-  Bark (text-to-speech) 
 There is a huge interest for addingggmlsupport for this model (see speeding up inference suno-ai/bark#30 (comment) )
 The main blocker seems to be the dependency on Facebook's EnCodec codec. Still not sure how difficult it would be, but probably this codec is another model that we should try to support viaggml
 I'll use this section to add a note regarding new model implementations by contributors - I recommend to always try to add a very basic example implementation to the ggml repo. Having a basic example there would make long-term support much easier 
- 
Proof-of-concept for 100% inference on the GPU The goal is to make a demonstration of the idea discussed in Add GPU support to ggml #914 
 Very preliminary work has been started in ggml : cgraph export/import/eval example + GPU support ggml#108
 Will try to get a working example using the MNIST inferenceUpdate: The MNIST inference on Apple Silicon GPU using Metal is now fully demonstrated: ggml : cgraph export/import/eval example + GPU support ggml#108 -- this is the way 
Low-prio
- 
Project ggml : improve threading implementation Better utilization of the available CPU resources via improved thread management. 
 There have been a few efforts during April, but they remained in the "background" - need to put more focus this time
- 
Having second thoughts about adding llama_stateThe experience is we added whisper_stateinwhisper.cppwith this PR: Added whisper state + default state on the whisper_context whisper.cpp#523However, I haven't see a lot of use of it. At the same time, it doubled the C API. 
 Let me know if you think this is worth implementingref: IMPORTANT: Introduce C-style API - Major Refactoring #370 (comment) 
- 
Add 3-bit integer quantization It has been shown that 2-bit integer quantization does not look really useful for anything: Q2 and Q3 quantization #1004 In this case, we can probably add 3-bit integer quantization. Probably not "officially supported", but rather in a state where we can run experiments and see if we can find some application 
- 
There is an interesting ongoing effort to add "training" support to ggml: How to fine tune it? ggml#8 (comment)It would be really impressive if this actually works. Might be conflicts with the refactoring pass - need to coordinate with @xaedes Update: this has been successfully completed and there is now a simple example demonstrating baby-LLaMA training: https://github.com/ggerganov/llama.cpp/blob/master/examples/baby-llama/baby-llama.cpp#L759-L770 
Beta Was this translation helpful? Give feedback.
All reactions
- 
 👍 13
- 
 ❤️ 39
- 
 🚀 17
Replies: 3 comments 11 replies
-
Beta Was this translation helpful? Give feedback.
All reactions
-
What do you mean?
Beta Was this translation helpful? Give feedback.
All reactions
- 
 😄 3
-
Likely meaning stablelm
Beta Was this translation helpful? Give feedback.
All reactions
- 
 👍 1
-
no stablelm is supported (by ggml, see repo) but its quality is underwhelming....
still, latent diffusion models would be sick
Beta Was this translation helpful? Give feedback.
All reactions
-
If I'm understanding the idea of llama_state--that it will allow multiple "inference threads" from a single loaded model, then it definitely seems worth implementing, since it opens up a lot of possibilities.
Is the idea that we can get a lot of the same gains by just quickly swapping out stored contexts? A lot of llm applications benefit from having multiple instances that can build on one another, or different instances that receive diverse queries.
I haven't started using it yet, because I've been waiting for someone to post an example. It's a bit hard for me to parse the api.
Beta Was this translation helpful? Give feedback.
All reactions
-
The right way to do it is like we do it in whisper.cpp:
https://github.com/ggerganov/whisper.cpp/blob/master/whisper.h
If someone wants to give it a try at implementing it here
Beta Was this translation helpful? Give feedback.
All reactions
- 
 👍 1
-
Should every public API has an implicit and an explicit _with_state & _from_state version though? That seems pretty verbose.
Granted, the implicit version could probably just forward to the explicit version.
Still, does that mean if one choose the explicit state version, the context would still contain an unused state?
Beta Was this translation helpful? Give feedback.
All reactions
-
I agree that it is a bit over-verbose, but we didn't see a better way. Open to suggestions
Still, does that mean if one choose the explicit state version, the context would still contain an unused state?
There are init calls that explicitly do not create an internal state:
Beta Was this translation helpful? Give feedback.
All reactions
-
I suggest using the llama_context as the llama_state. For me that seems semantically correct and the most logical approach, it is fully backwards compatible requiring no changes for existing public API users. It's also very simple with only few changes internally and only few additions to the public API only for those who would like to use this feature.
I created a pull request here: #1797 (comment)
Beta Was this translation helpful? Give feedback.
All reactions
- 
 👍 1
-
I just tried to create a llama_state but the duplication in sampling is too much.
All sampling functions rely on ctx->rng.
It's also a constant duplication factor to all new public API going forward.
@didzis change is great, just 2 new public functions.
Beta Was this translation helpful? Give feedback.
All reactions
-
do we have weigh model support for Encodec? If yes, could you please tell me how to build the sgml-model.bin?
Thank you appreciate your help and time.
Beta Was this translation helpful? Give feedback.