Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

GGML philosophy question... #16707

Unanswered
cptspacemanspiff asked this question in Q&A
Discussion options

Hey, so this is a bit weird, but I have been starting a project for edge inference.

I had looked at ggml/llama-cpp in the past, but it always seemed like it was focused on the 'happy path' of decoder only models, running on common devices, that are not doing anything exceptionally weird. Since I want to do things like KV cache manipulation and cross attention cache stuff with encoder/decoder models. Maybe I am missing something but what bothered me is that it seems like you have to hand make a new runtime inference model, for each new model you support.

Combined with multiple backends for CPU, MLX, NVIDIA, AMD, and sometimes custom NPU. you end up getting a exponential explosion of combinations of edge cases you need to support.

I guess I am comparing this to Executorch, which has an annoyingly complex export process, but exports a graph at the end of the day, which can be run on a given backend, maybe I am wrong but it seems like a cleaner method to go from a HF server model to an edge deployed model (easier to test with fewer areas of divergence in behavior).

I am not 100% sure what I am asking, but how maintainable/extensable is the software architecture as new models types come out, is this something that you guys have thought about in designing the current architecture? (I think I saw some of this referenced with regards to GGUF model format) Or am I missing the plot entirely.

You must be logged in to vote

Replies: 1 comment

Comment options

I had looked at ggml/llama-cpp in the past, but it always seemed like it was focused on the 'happy path' of decoder only models, running on common devices, that are not doing anything exceptionally weird. Since I want to do things like KV cache manipulation and cross attention cache stuff with encoder/decoder models.

This is mostly true today, but I think with time we can improve. Technically there are almost no limitations of what you can implement with ggml, but doing certain experiments like cache manipulation can be significantly more difficult compared to other high-level frameworks. Making this easier comes with various abstractions that will always sacrifice some capabilities.

Combined with multiple backends for CPU, MLX, NVIDIA, AMD, and sometimes custom NPU. you end up getting a exponential explosion of combinations of edge cases you need to support.

It's not exponential - it is a linear function of the number of operators.

I guess I am comparing this to Executorch, which has an annoyingly complex export process, but exports a graph at the end of the day, which can be run on a given backend, maybe I am wrong but it seems like a cleaner method to go from a HF server model to an edge deployed model (easier to test with fewer areas of divergence in behavior).

I'm not familiar with Executorch, but I have thought about the "graph export" feature and I don't really see any arguments for it. Writing a graph in code has no disadvantage, and maybe even has various advantages compared to exporting it to some format.

I am not 100% sure what I am asking, but how maintainable/extensable is the software architecture as new models types come out, is this something that you guys have thought about in designing the current architecture?

I think we have a very good software architecture that can be extended across every hardware and model. I don't really have a good knowledge about the architecture of other frameworks to comment, but I would not be surprised if our approach has the best ratio of capabilities/complexity.

We do need to pay attention to the architecture and the engineering process and there are certainly many things we can improve and do better. Hoping that with time as the project continues to grow and becomes more and more adopted, we will attract good engineers to help us in this regard.

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet

AltStyle によって変換されたページ (->オリジナル) /