Inference at the edge

ggerganov
Mar 16, 2023
Maintainer

Based on the positive responses to whisper.cpp, and more recently, llama.cpp, it looks like there is a strong and growing interest for doing efficient transformer model inference on-device (i.e. at the edge).

image

The past few days, I received a large number of requests and e-mails with various ideas for startups, projects, collaboration. This makes me confident that there is something of value in these little projects. It would be foolish to let this existing momentum go to waste.

Recently, I've also been seeing some very good ideas and code contributions by many developers:

@ameenba suggested a way to improve Whisper Encoder evaluation at the cost of accuracy #137. Ultimately, this allowed to demonstrate semi efficent short voice command recognition on a Raspberry Pi 4 Twitter
@wangchou et. al. demonstrated how to evaluate the Whisper Encoder efficiently using Apple Neural Engine #548
A person on Twitter showed me how to improve llama.cpp efficiency by 10% with simple SIMD change Twitter
@Const-me kindly provided efficient AVX2 quantization routines #27
and many other examples

The AI field currently presents a wide range of cool things to do. Not all of them (probably most) really useful, but still - fun and cool. And I think a lot of people like to work on fun and cool projects (for now, we can leave the "useful" projects to the big corps :)). From chat bots that can listen and talk in your browser, to editing code with your voice or even running 7B models on a mobile device. The ideas are endless and I personally have many of them. Bringing those ideas from the cloud to the device, in the hands of the users is exciting!

Naturally, I am thinking about ways to build on top of all this. So here are a few thoughts that I have so far:

This project will remain open-source
I would like to explore the application of this approach to other models and build more examples to demonstrate it
I would be really happy to see developers join in and help advance further the idea of "inference at the edge"
The strongest points of the current codebase are it's simplicity and efficiency. Performance is essential
It's early to build a full-fledged edge inference framework. The code has to remain simple and compact in order to allow for quick and easy modifications. This helps to explore ideas at a much higher rate. Bloating the software with the ideas of today will make it useless tomorrow
The AI models are improving at a very high rate and it is important to stay on top of it. The transformer architecture in it's core is very simple. There is no need to "slap" complex things on top of it
Hacking small tools and examples is a great way to drive innovation. We should not get lost into software engineering problems. Especially at the beginning, the goal is to prototype and not waste time in polishing products
And most of all, it's important to have fun in the process!

I hope that you share the hacking spirit that I have and would love to hear your ideas and comments about how you see the future of "inference at the edge".

Edit: "on the edge" -> "at the edge"

Replies: 24 comments 35 replies

altryne
Mar 16, 2023

Kudos on all the incredible work you and the community did!
It's incredibly important to have an open source self hosted version of some of this progress

1 reply

@PhillipGimmi

PhillipGimmi Dec 13, 2023

you are a bamf

chidiwilliams
Mar 16, 2023

Thanks for your amazing work with both whisper.cpp, and llama.cpp. I've been hugely inspired by your contributions (especially with Whisper while working on Buzz), and I share your excitement about all the amazing possibilities of on-device inference.

Personally, I've been thinking a lot about multi-modal on-device inference, like an assistive device that can both capture image data and respond to voice and text commands offline, and I plan to hack something together soon. In any case, thanks again for all your work. I really do hope to continue contributing to your projects (and keep improving my C++ :)) and learning more from you in the future.

— Chidi

2 replies

@jonathgh

jonathgh Apr 9, 2023

I can only second @chidiwilliams here, that I am incredibly grateful for the work that @ggerganov has done with both Whisper.cpp and LLaMa.cpp. I believe it has really put the power of these models into the hands of users, and that the future is in on-device and at the edge. We have been very inspired by the work and wanted to make it even more user-friendly with WhisperScript) so that even non-programmers can benefit from the advancements in this tech. Thanks for the work and looking forward to learning even more from you!

@jonathgh

jonathgh Jan 29, 2025

We now also have a Windows version available, if anyone is looking for a simple, clean UI to run Whisper on their Windows desktop: Windows WhisperScript UI

suncho6502
Mar 16, 2023

"This project will remain open-source" ♥️
You hear that, OpenAI?

1 reply

@LongBeardMan

LongBeardMan Mar 21, 2023

The CorpTM's firewall of cash blocked it but if we work back from the edge we should be able to get through and get some of that sweet sweet compute. If attention is all it needs it's definitely got it now ( but I jest ⚖️ )

MarkSchmidty
Mar 16, 2023

As someone involved in a number of "Inference at the edge" projects which are features focused, the feature all of them truly needed was performance! Now with llama.cpp and its forks these projects can reach audiences they never could before.

These are good thoughts.

Thank you Georgi, and thank you everyone else who has contributed and is contributing no matter how big or small.

0 replies

m-a-sch
Mar 17, 2023

this core / edge thing appears to have a lot of potential and should at least be made very transparent by all. I.e. what exactly runs at the edge in the current commercial products? so that we know how to plan our resources and not get "frustrated" when, e.g., Microsoft Edge crashes on some of them, or when the claims don't hold because the edge is not at the expected level.

Before more countries embark on building AI capabilities (UK?), maybe one should get that clear. Will there be a national core or a national edge?

1 reply

@LongBeardMan

LongBeardMan Mar 21, 2023

I guess it depends on the end goal but in my mind it would be some sort of DMZ at first and then everyone is invited in and accessible through P2P Mesh style. Very much like the internet runs but more open to get the purest prima materia. Money, copyright and individual rights/freedoms are the biggest obstacles I guess. Apparently the transformers are driven by attention which is probably good. I like to pretend they would ultimately use creativity to overcome formality but who knows.

loretoparisi
Mar 17, 2023

This is the true spirit of hacking technologies from the bottom up! 🥷💻

This project will remain open-source

The code has to remain simple and compact in order to allow for quick and easy modifications

Keep doing the good work! 🚀🤘

0 replies

sswam
Mar 17, 2023

LLaMA.cpp works shockingly well. You've proved that we don't need 16 bits, only 16 levels! Thanks so much for what you're doing, it's because of people like you that open source / community / collaborative AI will catch up with the strongest commercial AI, or at least contribute substantially and remain valuable. I've noticed that open source software is generally of much higher quality (for security, dev and research purposes at least).

0 replies

yadamonk
Mar 17, 2023

I really appreciate the incredible efforts you've put into this project. It's wonderful to know that you intend keep it open-source and are considering the addition of more models. One particularly intriguing model is FLAN-UL2 20B. Though its MMLU performance is inferior to LLaMA 65B, it has already been instruction fine-tuned and comes with an Apache 2.0 license. This could potentially enable a lot of interesting real-world use cases, such as question answering across a collection of documents. It is however an encoder-decoder architecture and might be more work to get up and running.

2 replies

@sandorkonya

sandorkonya Mar 20, 2023

+1 for Flan, but for the FLAN-T5 XXL version =)
One can compare them here.

@yadamonk

yadamonk Mar 20, 2023

Flan-T5-XXL is also a great model, but I'm not convinced that it's better. Here is another HF space that offers direct side-by-side comparison.

yadamonk
Mar 20, 2023

One important difference is that FLAN-UL2 20B has a 4x bigger context length with 2048 tokens. That makes it more suitable for some use-cases such as in-context learning and retrieval-augmented generation.

7 replies

@loretoparisi

loretoparisi Mar 21, 2023

so the int4 here requires only 6GB ram? It seems to me that the tokenizer's vocabulary has mixed zh and en subwords.

@MarkSchmidty

MarkSchmidty Mar 21, 2023

Yes chatGLM was finetuned over 1 trillion tokens of Chinese and English dialogue corpus as well as extensive RLHF in both languages. The base model, GLM-130B, was already impressive and largely overlooked prior to instruct finetuning.

@loretoparisi

loretoparisi Mar 21, 2023

and if we look at the HF code here it is using SentencePiece tokenizer
with

PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
 "THUDM/chatglm-6b": 2048,
}

hence I have understood well, it should be easy to port it to GGML, isn't it?

@LeeeSe

LeeeSe Mar 21, 2023

Looking forward to your results, a lot of people are waiting for the ggml version of chatGLM

@yadamonk

yadamonk Mar 22, 2023

It's an interesting model but licensed solely for non-commercial research purposes. Personally, I think it would be really nice to have a capable and truly open source model like FLAN-UL2 20B available in llama.cpp.

PJ-Finlay
Mar 22, 2023

CTranslate2 could be good inspiration, it's a C++ inference engine for Transformer language models with a focus on machine translation. It performs very well on CPUs and uses hardware acceleration effectively.

Thanks for the great open-source work! I want to play around with llama.cpp this weekend.

1 reply

@ggerganov

ggerganov Mar 22, 2023
Maintainer Author

I agree - CTranslate2 seems to be very efficient and the authors have done a great job. Their faster-whisper implementation actually outperforms significantly whisper.cpp.

Const-me
Mar 22, 2023

@ggerganov About these AVX2 routines, I had an idea about possible micro-optimizations.
Take a look, the only changed functions are dotProductCompressed40 and compressRow40:
https://gist.github.com/Const-me/4d30e1fc767ab314596e16e90f53b6f4

1 reply

@ggerganov

ggerganov Mar 24, 2023
Maintainer Author

Thank you @Const-me - much appreciated as always!
I will soon be looking into finalizing the 4-bit ggml branch and will definitely look into integrating the proposed changes

jinsu35
Mar 24, 2023

I like your phrase "Bloating the software with the ideas of today will make it useless tomorrow."

0 replies

Green-Sky
Mar 26, 2023
Collaborator

saw this post on hackernews
someone added openai plugins support to llama.cpp
https://github.com/lastmile-ai/llama-retrieval-plugin
https://blog.lastmileai.dev/using-openais-retrieval-plugin-with-llama-d2e0b6732f14

https://news.ycombinator.com/item?id=35315542

11 replies

@MarkSchmidty

MarkSchmidty Mar 27, 2023

The pinecone embeddings plugin was just a proof of concept. The author was trying to show that plugins can work with local models.

Pretty much nobody running llama has a need for Pincone, yes. But the proof that we can do plugins just like chatGPT can is cool and we should make more which are actually useful.

@anzz1

anzz1 Mar 27, 2023

@MarkSchmidty Wait, do you mean that the plugins themselves could work completely outside the chatgpt service and have the cloud environment and API hosted locally? If that's true, then that is indeed awesome. I thought it was just a interface to the openai API? Now I'm confused.

@anzz1

anzz1 Mar 27, 2023

I'm not against cloud services per-se and there are many which I like and use, what I am against is the tricks and attempts to lock users in their ecosystems, and as so many 'cloud service' providers aggresively move towards forced SaaS-models, hearing the word 'cloud' makes me slightly cringe already. I do understand why its happening, it makes very good business sense but it doesnt make it any less scummy. If this really is something that works towards truly "Open" AI , and not a underhanded attempt to sell their services then fantastic! You'd have to excuse my cynicism when it comes to the 'cloud' topic, as not all of them are bad actors.

@MarkSchmidty

MarkSchmidty Mar 27, 2023

The llama-retrieval-plugin does not contact or use OpenAI's servers in any way. It connects your locally running llama.cpp with one of a selection of vector database providers, including a local redis vector database option. Vector databases don't really have use cases for personal use. The reason this plugin was used for the proof of concept is because it's the only plugin OpenAI open sourced (so far).

Now that we know how these plugins work broadly and how to adapt them for local models, we should be able to reverse engineer more useful plugins like web browser, code interpreter, Zapier, etc. to keep up with ChatGPT's latest improvements while not giving anything to OpenAI.

@m-a-sch

m-a-sch Mar 28, 2023

really?
image

edmundronald
Mar 27, 2023

If this software is much faster at inference than other methods because of quantification, we can expect people who train models to train their models directly as quantified or easily quantifiable ones, in order to be able to move them to the edge (or to cheaper data-centre inference hardware).
Therefore putting too many special case optimisations in the main body of code here is a bad idea, because we want model designers to have a clear common target for inference.

Btw, thank you very much for this code which runs beautifully on my 10 year old Macbook Pro. Edmund.

3 replies

@MarkSchmidty

MarkSchmidty Mar 27, 2023

The be clear, most of the speed is coming from moving fewer bits between CPU and RAM due to smaller models. RAM bandwidth is the bottleneck for efficient CPU inference. Quantization, reducing bits per parameter, is just one way to reduce the memory footprint of a model. Other methods, such as pruning, remove entire parameters with little effect on quality. These can even be combined-- and we've yet to explore that. It's quite likely we can get nearly 50% more model size and 50% higher speeds without any additional optimizations to the inference code.

@anzz1

anzz1 Mar 27, 2023

@MarkSchmidty You are right at the theoretical level that the CPU<->RAM connection and the sequential nature of CPU computation are the ultimate bottlenecks. I'm not claiming to be a mathematician nor to completely understand the intricancies of the calculations themselves, but I do understand optimization. And in that regard, while you are not wrong in saying that most of the speed comes from having to deal with less data, it's also a bit misleading as the importance of fast and optimized code cannot be understated. To be able to go even near the absolute limit of performance of the hardware. Anyone telling otherwise, I dare them to make a equally performing CPU implementation in something like .NET or Python. Exactly because having to move so much data and performs tons of calculations all the inefficiencies in code add up, and add up fast. And save from hand-optimizing in pure assembly, C is pretty much the best you can do to be as close to the bare metal as possible and utilising the cycles & bandwidth efficiently and not wasting them. @ggerganov has simply done an amazing job in making a lean and efficient codebase. It's true that there might not be much more performance to gain by optimizing the inference code since it is already well optimized, but then again I could be wrong. Modern processors rely heavily on branch prediction and caching and there still can be performance to gain by reducing cache misses and wrong branch predictions. Unfortunately the whole area of it is pretty much black magic / art and without an expert background in processor design it is very hard to optimize for at that very lowest silicon level.

@MarkSchmidty

MarkSchmidty Mar 27, 2023

the importance of fast and optimized code cannot be understated.

I didn't mean to imply otherwise. Georgi deserves all the credit. It's definitely possible there are memory handling optimizations which could improve performance on the code side. But it would likely take a processor design expert to identify them, and there are very few of those in the world. Currently there's some obvious low hanging fruit on the model size end of things (sparsity, pruning, 3bit, etc.).

xloem
Apr 1, 2023

There are libraries for hardware-accelerated matrix options on the raspberry pi: https://github.com/Idein/qmkl6

1 reply

@Green-Sky

Green-Sky Apr 5, 2023
Collaborator

did anyone test this yet? seems like an easy thing to try.

JPBFRCA3
Apr 5, 2023

Very slow hope iPhone

1 reply

@af-kka

af-kka Oct 21, 2024

I tried codellama 7b quantized and llama 3.2 3.7b (w/o quantization) models on iPhone 16 pro, got tremendous results. Wonder what is the problem in your case.

djaffer
Apr 8, 2023

This is a great effort and hope it reaches somewhere. Right now, the cost to run model for inference in GPU is cost-prohibitive for most ideas, projects, and bootstrapping startups compared to just using chatgpt API. Once you are locked in the ecosystem the cost which seems low for tokens, can increase exponentially. Plus, llama licensing is also ambiguous.

0 replies

Free-Radical
Apr 29, 2023

Thanks for the incredible work!

0 replies

ThibautLEAUX
May 15, 2023

Inference at the edge

Based on the positive responses to whisper.cpp, and more recently, llama.cpp, it looks like there is a strong and growing interest for doing efficient transformer model inference on-device (i.e. at the edge).

image

The past few days, I received a large number of requests and e-mails with various ideas for startups, projects, collaboration. This makes me confident that there is something of value in these little projects. It would be foolish to let this existing momentum go to waste.

Recently, I've also been seeing some very good ideas and code contributions by many developers:

@ameenba suggested a way to improve Whisper Encoder evaluation at the cost of accuracy #137. Ultimately, this allowed to demonstrate semi efficent short voice command recognition on a Raspberry Pi 4 Twitter

@wangchou et. al. demonstrated how to evaluate the Whisper Encoder efficiently using Apple Neural Engine #548

A person on Twitter showed me how to improve llama.cpp efficiency by 10% with simple SIMD change Twitter

@Const-me kindly provided efficient AVX2 quantization routines #27

and many other examples

The AI field currently presents a wide range of cool things to do. Not all of them (probably most) really useful, but still - fun and cool. And I think a lot of people like to work on fun and cool projects (for now, we can leave the "useful" projects to the big corps :)). From chat bots that can listen and talk in your browser, to editing code with your voice or even running 7B models on a mobile device. The ideas are endless and I personally have many of them. Bringing those ideas from the cloud to the device, in the hands of the users is exciting!

Naturally, I am thinking about ways to build on top of all this. So here are a few thoughts that I have so far:

This project will remain open-source

I would like to explore the application of this approach to other models and build more examples to demonstrate it

I would be really happy to see developers join in and help advance further the idea of "inference at the edge"

The strongest points of the current codebase are it's simplicity and efficiency. Performance is essential

It's early to build a full-fledged edge inference framework. The code has to remain simple and compact in order to allow for quick and easy modifications. This helps to explore ideas at a much higher rate. Bloating the software with the ideas of today will make it useless tomorrow

The AI models are improving at a very high rate and it is important to stay on top of it. The transformer architecture in it's core is very simple. There is no need to "slap" complex things on top of it

Hacking small tools and examples is a great way to drive innovation. We should not get lost into software engineering problems. Especially at the beginning, the goal is to prototype and not waste time in polishing products

And most of all, it's important to have fun in the process!

I hope that you share the hacking spirit that I have and would love to hear your ideas and comments about how you see the future of "inference at the edge".

Edit: "on the edge" -> "at the edge"

Hi,

I made a flutter mobile app with your project, i am really interested in this idea!

The repo : Sherpa Github
Playstore link : Sherpa

I will soon update to the latest version of llama, but if there is an optimisation for low end devices, I am extremely interested.

0 replies

ubernion
Oct 23, 2023

we will look back on this in 20 years, when telling the tales of cloud vs edge...

2 replies

@Cyberhan123

Cyberhan123 Dec 2, 2023

I believe more in giving power to everyone, rather than to a dictatorial cloud.

@Cyberhan123

Cyberhan123 Dec 2, 2023

Although I say this like a rant, the facts prove that it is easier to get up from the clouds than to get down from the clouds. With the improvement of computing power, what we lack is not large servers, but private and controllable edge software.

James-E-A
Aug 25, 2024

I really appreciate your support for memory-mapped files. Having the ability to "cheat" and run models I shouldn't be able to due to your efficient programming is simply a miracle.

0 replies

xloem
Aug 25, 2024

what do you think it is that happened that people might consider that they "shouldn’t be able" to run large models on small hardware?
i’m old, in the 90s we did everything in under 64k, huge worlds, raytracers. there are still today yearly competitions around who can optimize code the best to run on small and slow systems.

1 reply

@James-E-A

James-E-A Aug 25, 2024

Having the ability to "cheat" and run models I shouldn't be able to due to your efficient programming is simply a miracle.

what do you think it is that happened that people might consider that they "shouldn’t be able" to run large models on small hardware?

i’m old, in the 90s we did everything in under 64k ...

It seems the problem is two-pronged:

From the individual end, the "Conspicuous Consumption" syndrome causes people to consider their upper-middle-end builds as a baseline, rather than considering the baseline as a baseline. When these people go on to be software developers, we get today's loss of accessibility.
From the corporate end, it's ironically a lot more comfortable if software requires a lot of resources. If you "have" to drop 10,000ドル on hardware to run a piece of software, then suddenly paying 700ドル/mo for a corporate support plan becomes more reasonable by comparison. Or you could just pay Amazon 700ドル/mo for their specially licensed GPU cluster, which then looks like an amazing deal! Creating the synthetic need for excessive hardware raises opportunity costs, which is profitable.

lgylgy1
Dec 3, 2024

感谢您令人惊叹的项目!

0 replies

ludiusvox
Aug 14, 2025

You guys need a donation page. This project keeps me from allowing these global poor overseas who fix OpenAI, and Gemini, and allowing a different system to be developed. I would be more than happy to throw a few dollars in the tip jar!

0 replies

Inference at the edge #205

Uh oh!

Uh oh!

ggerganov Mar 16, 2023 Maintainer