Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Roadmap May 2023 #1220

Apr 28, 2023 · 3 comments · 11 replies
Discussion options

High-prio

  • Refactoring pass

    There is a lot of code duplication in ggml.c which probably can be simplified with a good set of macros. The goal is to keep the code size manageable, while we avoid reaching "macro hell"

  • Optimize the AVX / AVX2 implementations of the quantization methods and add WASM SIMD

    Make sure we have optimal implementation for these instruction sets

  • Apply the new integer quantization methods to whisper.cpp

    Will backport the latest ggml version to whisper.cpp and add support for quantized models. Will also update all WASM examples to be able to run with the quantized models

    Update: whisper.cpp v1.4.0 has been released. It includes integer quantization and GPU support via cuBLAS - all thanks to the great work done here

  • Add support for "batch inference"

    Recently, the bert.cpp (by @skeskinen) project demonstrated BERT inference using ggml. This model gains a lot from batch inference, which is currently not supported by ggml. We will extend all operators to support it. The bert.cpp example will serve as a playground to achieve this

    Update: batched forward passes have been demonstrated in the baby-llama example (thanks to @xaedes Implement backward passes for llama with small training llama from scratch example #1360 ). It will be great to apply the demonstrated approach to bert.cpp and whisper.cpp's beam-search decoding in order to gain extra speed-up

  • Implement inference of new models

    There are already some very interesting models that should be supported by ggml:

    I'll use this section to add a note regarding new model implementations by contributors - I recommend to always try to add a very basic example implementation to the ggml repo. Having a basic example there would make long-term support much easier

  • Proof-of-concept for 100% inference on the GPU

    The goal is to make a demonstration of the idea discussed in Add GPU support to ggml #914
    Very preliminary work has been started in ggml : cgraph export/import/eval example + GPU support ggml#108
    Will try to get a working example using the MNIST inference

    Update: The MNIST inference on Apple Silicon GPU using Metal is now fully demonstrated: ggml : cgraph export/import/eval example + GPU support ggml#108 -- this is the way

Low-prio

You must be logged in to vote

Replies: 3 comments 11 replies

Comment options

You must be logged in to vote
3 replies
Comment options

What do you mean?

Comment options

Likely meaning stablelm

Comment options

no stablelm is supported (by ggml, see repo) but its quality is underwhelming....
still, latent diffusion models would be sick

Comment options

If I'm understanding the idea of llama_state--that it will allow multiple "inference threads" from a single loaded model, then it definitely seems worth implementing, since it opens up a lot of possibilities.

Is the idea that we can get a lot of the same gains by just quickly swapping out stored contexts? A lot of llm applications benefit from having multiple instances that can build on one another, or different instances that receive diverse queries.

I haven't started using it yet, because I've been waiting for someone to post an example. It's a bit hard for me to parse the api.

You must be logged in to vote
8 replies
Comment options

ggerganov May 20, 2023
Maintainer Author

The right way to do it is like we do it in whisper.cpp:

https://github.com/ggerganov/whisper.cpp/blob/master/whisper.h

If someone wants to give it a try at implementing it here

Comment options

Should every public API has an implicit and an explicit _with_state & _from_state version though? That seems pretty verbose.
Granted, the implicit version could probably just forward to the explicit version.

Still, does that mean if one choose the explicit state version, the context would still contain an unused state?

Comment options

ggerganov Jun 10, 2023
Maintainer Author

I agree that it is a bit over-verbose, but we didn't see a better way. Open to suggestions

Still, does that mean if one choose the explicit state version, the context would still contain an unused state?

There are init calls that explicitly do not create an internal state:

https://github.com/ggerganov/whisper.cpp/blob/57543c169e27312e7546d07ed0d8c6eb806ebc36/whisper.h#L109

Comment options

I suggest using the llama_context as the llama_state. For me that seems semantically correct and the most logical approach, it is fully backwards compatible requiring no changes for existing public API users. It's also very simple with only few changes internally and only few additions to the public API only for those who would like to use this feature.

I created a pull request here: #1797 (comment)

Comment options

I just tried to create a llama_state but the duplication in sampling is too much.
All sampling functions rely on ctx->rng.

It's also a constant duplication factor to all new public API going forward.

@didzis change is great, just 2 new public functions.

Comment options

do we have weigh model support for Encodec? If yes, could you please tell me how to build the sgml-model.bin?
Thank you appreciate your help and time.

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

AltStyle によって変換されたページ (->オリジナル) /