GGML philosophy question... · ggml-org/llama.cpp · Discussion #16707

cptspacemanspiff
Oct 21, 2025

Hey, so this is a bit weird, but I have been starting a project for edge inference.

I had looked at ggml/llama-cpp in the past, but it always seemed like it was focused on the 'happy path' of decoder only models, running on common devices, that are not doing anything exceptionally weird. Since I want to do things like KV cache manipulation and cross attention cache stuff with encoder/decoder models. Maybe I am missing something but what bothered me is that it seems like you have to hand make a new runtime inference model, for each new model you support.

Combined with multiple backends for CPU, MLX, NVIDIA, AMD, and sometimes custom NPU. you end up getting a exponential explosion of combinations of edge cases you need to support.

I guess I am comparing this to Executorch, which has an annoyingly complex export process, but exports a graph at the end of the day, which can be run on a given backend, maybe I am wrong but it seems like a cleaner method to go from a HF server model to an edge deployed model (easier to test with fewer areas of divergence in behavior).

I am not 100% sure what I am asking, but how maintainable/extensable is the software architecture as new models types come out, is this something that you guys have thought about in designing the current architecture? (I think I saw some of this referenced with regards to GGUF model format) Or am I missing the plot entirely.

Replies: 1 comment

ggerganov
Oct 22, 2025
Maintainer

I had looked at ggml/llama-cpp in the past, but it always seemed like it was focused on the 'happy path' of decoder only models, running on common devices, that are not doing anything exceptionally weird. Since I want to do things like KV cache manipulation and cross attention cache stuff with encoder/decoder models.

This is mostly true today, but I think with time we can improve. Technically there are almost no limitations of what you can implement with ggml, but doing certain experiments like cache manipulation can be significantly more difficult compared to other high-level frameworks. Making this easier comes with various abstractions that will always sacrifice some capabilities.

Combined with multiple backends for CPU, MLX, NVIDIA, AMD, and sometimes custom NPU. you end up getting a exponential explosion of combinations of edge cases you need to support.

It's not exponential - it is a linear function of the number of operators.

I guess I am comparing this to Executorch, which has an annoyingly complex export process, but exports a graph at the end of the day, which can be run on a given backend, maybe I am wrong but it seems like a cleaner method to go from a HF server model to an edge deployed model (easier to test with fewer areas of divergence in behavior).

I'm not familiar with Executorch, but I have thought about the "graph export" feature and I don't really see any arguments for it. Writing a graph in code has no disadvantage, and maybe even has various advantages compared to exporting it to some format.

I am not 100% sure what I am asking, but how maintainable/extensable is the software architecture as new models types come out, is this something that you guys have thought about in designing the current architecture?

I think we have a very good software architecture that can be extended across every hardware and model. I don't really have a good knowledge about the architecture of other frameworks to comment, but I would not be surprised if our approach has the best ratio of capabilities/complexity.

We do need to pay attention to the architecture and the engineering process and there are certainly many things we can improve and do better. Hoping that with time as the project continues to grow and becomes more and more adopted, we will attract good engineers to help us in this regard.

0 replies

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

GGML philosophy question... #16707

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

cptspacemanspiff
Oct 21, 2025

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

ggerganov
Oct 22, 2025
Maintainer

Select a reply

Uh oh!

Uh oh!

GGML philosophy question... #16707

Uh oh!

Uh oh!

cptspacemanspiff Oct 21, 2025

Replies: 1 comment

Uh oh!

ggerganov Oct 22, 2025 Maintainer

cptspacemanspiff
Oct 21, 2025

ggerganov
Oct 22, 2025
Maintainer