Vaibhavs10
Aug 14, 2025
Collaborator

llama.cpp as a project has made LLMs accessible to countless developers and consumers including me. The project has also consistently become faster over time as has the coverage beyond LLMs to VLMs, AudioLMs and more.

One feedback from community we keep getting is how difficult it is to directly use llama.cpp. Often times users end up using Ollama or GUIs like LMStudio or Jan (there's many more that I'm missing). However, it'd be great to offer a path to use llama.cpp in a more friendly and easy way to end consumers too.

Currently if someone was to use llama.cpp directly:

For Mac - brew install llama.cpp works
For Linux (CUDA) - they need to clone and install directly from github
For Windows - winget (?)

This adds barrier for non technically inclined people specially since in all the above methods users would have to reinstall llama.cpp to get upgrades (and llama.cpp makes releases per commit - not a bad thing, but becomes an issue since you need to upgrade more frequently)

Opening this issue to discuss what could be done to package llama.cpp better and allow users to maybe download an executable and be on their way.

More so, are there people in the community interested in taking this up?

Replies: 46 comments 58 replies

KhazAkar
Aug 14, 2025

From this, IMO it only misses Linux+CUDA bundle to be useable as download & use.

Click to expand

Screenshot_20250814-160104

If we want better packaging on Linux, we can also work on snap/bash installer when trying to use pre-built packages.

2 replies

@michaelgiba

michaelgiba Aug 15, 2025

On the first point this was a pain point I faced as well (#15249)

@maxwell-kalin

maxwell-kalin Sep 2, 2025

"snap" the thing from ubuntu?

mitkox
Aug 14, 2025

It’s high time HuggingFace to copy Ollama’s packaging and GTM strategy, but this time, give credit to llama.cpp. Ideally, we should retain llama.cpp as the core component.

2 replies

@KhazAkar

KhazAkar Aug 14, 2025

For HF - they should follow similar path as Jan.ai does for example, a.k.a raw llama.cpp, not some weird forks like ollama.

@mitkox

mitkox Aug 14, 2025

On technology- yes, on marketing strategy, GTM, and partnership lock - there is nothing like Ollama's way of doing it. I can't open any app without ollama support to be the "local AI" support.

slaren
Aug 14, 2025
Maintainer

Is the barrier the installation process, or the need to use a complex command line to launch llama.cpp?

1 reply

@ericcurtin

ericcurtin Aug 15, 2025
Collaborator

A bit of both.

I think packaging has improved in recent times, windows as an example. I kinda like the containers approach for Linux, it makes sense, I'm not sure how many Linux distros are going to tolerate separate llama-cpp-cuda, llama-cpp-rocm, llama-cpp-vulkan packages as per their packaging guidelines.

I do think at a minimum llama.cpp should document upstream what flags should be used with each backend CUDA, Vulkan, Metal (--threads, --cache-reuse, --flash-attention, etc.), as discussed here:

ollama/ollama#11714 (comment)

People use sub-optimal flags often when using llama.cpp and assume performance isn't up to scratch.

qnixsynapse
Aug 14, 2025
Collaborator

4 replies

@slaren

slaren Aug 14, 2025
Maintainer

Are you using the GGML_CPU_ALL_VARIANTS option?

@qnixsynapse

qnixsynapse Aug 14, 2025
Collaborator

Nope. Separate instructions builds for CPU instructions(For example: using -DGGML_AVX512=ON for AVX512 build). We even found people use processors which doesn't even have avx2.

@slaren

slaren Aug 14, 2025
Maintainer

You should look into GGML_BACKEND_DL and GGML_CPU_ALL_VARIANTS, it is designed to solve this problem. Most of our releases use it.

@qnixsynapse

qnixsynapse Aug 14, 2025
Collaborator

Thank you!

simonw
Aug 14, 2025

For me the biggest thing is llama-server. That tool is fantastic... but it has very low discoverability. Until a few months ago I still thought it was just a demo because it lived in the examples/ folder (I just noticed it moved from there to the tools/ folder in May).

I'd love to see more emphasis placed on llama-server. It's really good! I think it's probably the most end-user appropriate way to interact with this project.

My ideal would be for the llama.cpp project to ship official installers for llama-server on Mac and Windows and Linux, which on Mac and Windows work like a desktop application: you get an icon you can use to launch the tool which start the server running and launch a window that shows that web UI.

Maybe include systray integration and a simple UI for selecting and downloading models too.

At that point llama-server would feel like an alternative to Ollama and LM Studio and Jan. I think it deserves that - it has most of what those tools offer implemented already, what's missing is an installer and a think desktop shell.

6 replies

@t1u1

t1u1 Aug 14, 2025

My ideal would be for the llama.cpp project to ship official installers for llama-server on Mac and Windows and Linux,

An easy way to do this is to make it a PWA. Will work in most major browsers, including Safari, Chrome and its derivatives, Android and iOS, but not Firefox. It just takes a few lines of code to do this.

@allozaur

allozaur Aug 14, 2025
Collaborator

Hey @simonw, as a matter of fact we are working on new WebUI for llama.cpp and we are planning to put it into a native app as well!

The first step is going to be a release of a new version of WebUI rewritten in Svelte and with a much better UI/UX.

You can track the progress here #14839 (comment)

I am planning to have the Pull Request ready for a review at the beginning of the next week, so stay tuned!

@calculatortamer

calculatortamer Aug 14, 2025

i hope they won't deprecate llama-server CLI for a llama-server GUI, it would be too annoying to use on termux or over SSH

@red-co

red-co Aug 15, 2025

Hey @simonw, as a matter of fact we are working on new WebUI for llama.cpp and we are planning to put it into a native app as well!

The first step is going to be a release of a new version of WebUI rewritten in Svelte and with a much better UI/UX.

You can track the progress here #14839 (comment)

I am planning to have the Pull Request ready for a review at the beginning of the next week, so stay tuned!

I think I'm not that enough new to this, and I can compile llamacpp for a specific GPU, but I don't know how to use the llamacpp server for persistent chat-session data caching. Does it simply not have this feature? Or am I missing the parameter?

@linuxmagic-mp

linuxmagic-mp Oct 19, 2025

It's just moving to fast to be ready for a simple installer. Dont think a maintainer wants to keep up ;) But they have sure made it easy for anyone with basic skills in Linux. They have fixed so many compatability issues over the last year. It's pretty simple git pull, and cmake now. The hardest part if the python side of it. And yes, it takes some reading but it's about learning. You can even ask chatGPT on how to install it, the best parameters, what LLM's the would recommend for your hardware.. For most people, they want to learn.. Otherwise they would just get a subscription to some SaaS out there. So, rather than complaining, and putting the onus on the great ppl behind llama.cpp, roll up your sleeves.. However, in the end, a lot of people don't realize the heavy lifting comes in getting agentic tools working with your LLM..llams.cpp just made access to private LLM's possible to millions that want to 'play for free'. But for many, asl yourselves why you want your own LLM first.

yaronsumel
Aug 14, 2025

It will be cool if 'llama-server' would have auto configuration option to the machine/model like 'ollama' does it.

0 replies

exxocism
Aug 14, 2025

For windows maybe choco and windows store would be a good idea? 🤔

0 replies

acbits
Aug 14, 2025

llama.cpp as a project has made LLMs accessible to countless developers and consumers including me. The project has also consistently become faster over time as has the coverage beyond LLMs to VLMs, AudioLMs and more.

One feedback from community we keep getting is how difficult it is to directly use llama.cpp. Often times users end up using Ollama or GUIs like LMStudio or Jan (there's many more that I'm missing). However, it'd be great to offer a path to use llama.cpp in a more friendly and easy way to end consumers too.

Currently if someone was to use llama.cpp directly:
1. For Mac - `brew install llama.cpp` works
2. For Linux (CUDA) - they need to clone and install directly from github

I created a rpm spec to manage installation though I think flatpaks might be more user friendly and distribution agnostic.

0 replies

schnow265
Aug 14, 2025

For Windows - winget (?)

The released Windows builds are available via Scoop.

Updates happen automatically. Old installed versions are kept, and current one symlinked into a folder „current" which provides the executables on the path.

0 replies

damianofalcioni
Aug 14, 2025

is it feasible to have a single release for OS including all the backend?

2 replies

@slaren

slaren Aug 15, 2025
Maintainer

It's technically possible, but the size of the package would be quite big.

@SamuelMarks

SamuelMarks Aug 15, 2025

Easiest way is with an αcτμαlly pδrταblε εxεcμταblε

https://justine.lol/ape.html

tellsiddh
Aug 14, 2025

For linux I just install the vulkan binaries and run the server from there. Maybe we can have a install script like ollama that detects the system and launches the server which can be controlled from an app as well as cli? The user then gets basic command line utillities like run start stop load list etc?

0 replies

leoxzhao
Aug 14, 2025

On Mac, the easiest way (also arguably the safest way) from a user's perspective is to find it in App Store, and install from there. Because of apps from App Store are in a sandbox, so from a user's point of view, installing or uninstalling is simple and clean. Creating a build and passing the App Store review might take some efforts (due to the sandbox constraint), but it should be a one-time thing.

0 replies

O-J1
Aug 14, 2025

Its my understanding that none of the automated installs support GPU acceleration. I might be wrong but its definitely the case for Windows, which makes it useless to install via winget.

3 replies

@slaren

slaren Aug 15, 2025
Maintainer

The winget package comes with the Vulkan backend.

@O-J1

O-J1 Aug 25, 2025

Does that work reliably with CUDA cards and does it support the same kinds of models (Qwen2.5VL for instance)

@anderspitman

anderspitman Aug 25, 2025

My experience with the Vulkan backend on a RTX 3060 has been excellent.

henk717
Aug 14, 2025

To me the biggest advantage ollama currently has is that the optimal settings for a model are bundled, the gguf spec would allow for this to since its versatile enough to make this a metadata field inside the model. It would allow people to load the settings from a gguf and frontends can extract them and adapt them as they see fit. I think that part is going to be more valuable than obtaining the binary since downloading the binary from github is not that hard.

1 reply

@logan-markewich

logan-markewich Aug 14, 2025

This is the killer feature right here. Id use llama.cpp directly over any wrapper if this was added

digitalspaceport
Aug 14, 2025

My personal wishlist

Llama-swap integration
Catchall compiled binaries for linux, win, osx
GUI improvements

2 replies

@neutrinotek

neutrinotek Aug 16, 2025

I feel like point number 1 is widely undermentioned in this thread. The ease that Llama-Swap adds to being able to bounce around models for various tasks is invaluable. If it weren't for that, I would probably still be using Ollama. It still needs a bit of upfront setup, but once you get a model in, it basically becomes "set it and forget it."

Edit I just realized who I was replying to and felt the need to come back and say I love the YT channel man. Keep up the good work!

@ericcurtin

ericcurtin Aug 16, 2025
Collaborator

Docker Model Runner provides 1 and 2 FWIW... They are trying to grow the community also... It uses llama.cpp upstream, contributes any changes required back... Number 3... I recommend AnythingLLM, but there are many options

Windsage63
Aug 28, 2025

I just went through this, so it's fresh in my mind. I had not used Linux since about 2012, when I changed our in-house servers to Windows server. Running local llm's I have used Text-Generation-WebUI, Ollama, and LM Studio. With the release of GPT-OSS-120b, most of the wrappers did not keep up with the new changes as quickly as I wanted. I run a 5090, and I also wanted to move to the newer torch release. I was doing some small training stuff that needed Triton anyway, so, Linux.

OP lists using winget, in Windows, but the recompiled llama.cpp is Vulcan. The binary distributions stop at 12.4, so if you need the newer copies of torch >12.4, it's source code. Having been away from Linux a long time, setting up WSL and creating everything for a build was a bit of a struggle. I actually had GPT-OSS write me a step by step walk through with copy and paste commands so I wouldn't screw it up.

Once everything was running, it's great, and a significant speedup. I moved my dev work to Linux now. Then I wrote a script (really me and GPT-OSS wrote a script) to fire up llama.cpp, so I could just pick models from an old fashioned script menu. It's crude, but it's never out of date. I posted the script creation script that GPT wrote below because I thought the idea of a self creating script was nice.

make_l-server.md

0 replies

am17an
Sep 1, 2025
Collaborator

IMHO the UI and the packaging can be a dedicated separate repo under the ggml umbrella. Llama.cpp should be remain true to its manifesto of "inference at the edge" and continue to be a fast and feature-full backend for edge devices. The impedance mismatch of being both a developer and a consumer product is probably not easy to manage.

0 replies

l29ah
Sep 3, 2025

For Linux (CUDA) - they need to clone and install directly from github

llama.cpp is packaged well in Gentoo, no need to do anything fancy.

0 replies

MineloaderGuo
Sep 11, 2025

When will llama.cpp have the ability to search on the internet? Without the ability to search for information online and relying solely on existing training data, answering questions is always outdated or biased, and can only be considered a toy with no practical value

2 replies

@SteelPh0enix

SteelPh0enix Sep 11, 2025

I recommend setting up OpenWebUI with llama.cpp as back-end for that. You can plug the llama-server via OpenAI API.

@linuxmagic-mp

linuxmagic-mp Oct 19, 2025

It has that capability now.. you can add a 'tool', the .jinja methods are a bit of a pain..but you can declare those quite easily. A small little connector to brave's msp is a start.
'''
#!/usr/bin/env python3
import sys
import json
import requests

----------------------

Tool metadata

----------------------

TOOL_METADATA = {
"name": "brave_mcp",
"description": "Perform a Brave MCP search and return summarized results.",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Search query to send to Brave MCP"},
"num_results": {"type": "integer", "description": "Number of results to return"}
},
"required": ["query"]
},
"usage": "Use this tool whenever the user asks to search the web, or if you need current information only available by searching the web. Respond with elements of the search result in your answer."
}

----------------------

Tool logic

----------------------

def run_tool(params):
"""
Perform a Brave MCP search query.
"""
query = params.get("query", "")
num_results = int(params.get("num_results", 3))
MCP_ENDPOINT = "https://api.brave.com/mcp/search"

payload = {"query": query, "num_results": num_results}
try:
 resp = requests.post(MCP_ENDPOINT, json=payload, timeout=10)
 resp.raise_for_status()
 data = resp.json()
 results = []
 for item in data.get("results", []):
 results.append({
 "title": item.get("title", ""),
 "snippet": item.get("snippet", ""),
 "url": item.get("url", "")
 })
 return {"query": query, "results": results or []}
except requests.exceptions.RequestException as e:
 return {"error": f"Brave MCP request failed: {str(e)}"}
except Exception as e:
 return {"error": f"Unexpected error: {str(e)}"}

----------------------

Main execution

----------------------

if name == "main":
# Return metadata if requested
if len(sys.argv) > 1 and sys.argv[1] == "--metadata":
print(json.dumps(TOOL_METADATA, indent=2))
sys.exit(0)

# Read JSON input from stdin
try:
 params = json.load(sys.stdin)
except json.JSONDecodeError:
 print(json.dumps({"error": "Invalid JSON input"}))
 sys.exit(1)
# Run tool and return results
try:
 result = run_tool(params)
 print(json.dumps(result))
except Exception as e:
 print(json.dumps({"error": f"Tool execution failed: {str(e)}"}))
 sys.exit(1)

'''

cheery
Sep 15, 2025

I keep my llama.cpp in a podman container. I've setup scripts to build it in a container as well. And I'm using cuda.

Before I picked up llama.cpp I used ollama, found it bit too excessive and non-suited for me. Also saw that most libraries just package up llama.cpp and run my load there. I ended up just using llama.cpp directly through C API with python bindings.

I figure it's nice to build llama.cpp but a bit difficult to pick up because there are so many options and things to look at. Non technical people might have difficulty picking it up just for that. But I personally am here for it.

I find llama.cpp could use improvement on variety of places. One major place is the chat format interfaces. I think the current way chat templates work is too generic because the way this works is that you have an append-only log and you're paying high cost if you splice it. Chat templates on other hand pretend you can add stuff in the middle and they're essentially for-loops over messages list. I'd suggest you would completely change this design.

have a toplevel prompt.
Then have message templates for system, user, assistant prompts.
Likewise have messages for text in middle completion.
Finally have an assistant-starts-talking-predule...
...and a way to parse the assistant output back to corresponding messages.
and assistant-stops-talking -token sequence.
Just have a simple templating format that includes&leaves stuff from message if it is present or not.

In practice that covers all message formats because they're forced to be constrained in same way as long as they generate tokens in sequence. Even if the constraint was lifted some day, I think this makes a good format.

0 replies

bugparty
Sep 16, 2025

I think we should focus on what it is now, non-tech users will use Ollama or LMStudio anyway.

0 replies

angt
Sep 17, 2025
Collaborator

Hey everyone!

I’ve been working on some scripts and tools to make it super easy to download and run llama-server on any Linux machine (not just Ubuntu) and for both aarch64 and x86_64.

Feature detection is done before downloading to reduce download time, and CUDA is supported as well, though not fully optimized yet. This is also the only way to provide optimized builds for aarch64.

Right now, models are downloaded with hf download, to make things as easy, like Ollama.
That means the hf CLI is required for now.

On Linux, you can try it with:

curl -s https://angt.github.io/installama.sh | sh -s -- qwen3-4b:Q4_0

This should start a server running the requested model.

I’d love to hear your thoughts on this kind of packaging approach for Linux.
I’m currently exploring macOS with Metal support as the next step.

2 replies

@cheery

cheery Sep 17, 2025

The scripts you've written look super clean and easy to understand. The script probes with target-features that is super simple. It's got a similar program called cuda-probe that does the same for cuda. And it's necessary that they're coming first. You've created lot of binaries for different platforms and likely have the scripts in place to update them automatically.

It looks like bare minimum to get it up and running. I personally like bare minimums. I would even throw away the 'hf' and let people decide how they download their huggingface models. You've already done the bulk difficult job when you've compiled it all for them and feature-recognize which version needs to be downloaded.

Then if the community creates frontends, maybe, for instance for linux with python rich & argparse -modules so that it's nice and colorful and running in terminal. That might make it full turn-key solution. Or if you really insist writing more complex bash scripts, then that stuff.

TL;DR I like it!

@0cc4m

0cc4m Sep 18, 2025
Collaborator

Please keep in mind that piping scripts directly from a web page into your shell is inherently unsafe and you (the user) are responsible for checking the contents of the script before running it. It's very easy to do something unwanted this way.

rodrigovimieiro
Sep 17, 2025

For Linux (CUDA) - they need to clone and install directly from github

It would be nice to have binaries for Ubuntu with CUDA, even though the one with Vulkan works pretty well

1 reply

@angt

angt Sep 17, 2025
Collaborator

This exactly what installama provides: https://huggingface.co/datasets/angt/installamacpp-cuda/tree/main.
They should work on all recent Linux distributions.

pandruszkow
Sep 18, 2025

It would be great if GGML_CPU_ALL_VARIANTS=ON binaries with CUDA support were included in this. I have an old Ivy Bridge machine for llama.cpp inference, and it would make my life so much easier if I could just download a ready-to-go binary that will work with Nvidia cards and not require AVX2 or FMA, but take advantage of AVX and F16C.

2 replies

@angt

angt Sep 18, 2025
Collaborator

Yes for CUDA i let the llama.cpp defaults to wait for comments and check if the current values were the good ones :)
I believe I can reduce to AVX without causing significant performance degradation (need to check).

@angt

angt Sep 18, 2025
Collaborator

@pandruszkow I've updated the Linux CUDA binaries compiled with -march=x86-64-v2. To test them, please delete your ~/.installama directory and rerun the command. It would be awesome if you could check that they work on your setup :)

angt
Sep 19, 2025
Collaborator

Hey!!

Apple M1 through M4 are now supported with Metal. It turned out to be much easier than expected.
As before, just run:

curl -s https://angt.github.io/installama.sh | sh -s -- qwen3-4b:Q4_0

and you’re good to go 🚀

Again hf is still required to download the models, i'm going to work on that.

6 replies

@ggerganov

ggerganov Sep 22, 2025
Maintainer

Do you mean use libhttp for the file downloads?

Generally, dropping libcurl dependency would be great. But last time I looked into this, we still need libssl dependency which ironically seems even more difficult to provide compared to libcurl. But if you have ideas how to improve this, feel free to give it a try. Less dependencies is better.

@angt

angt Sep 22, 2025
Collaborator

Yes, this is exactly what I had in mind. I've opened a PR to explore this: #16185

@angt

angt Sep 23, 2025
Collaborator

All builds have been updated with the changes from PR #16185.
Now you can run:

curl -s https://angt.github.io/installama.sh | sh -s -- -hf unsloth/Qwen3-4B-Instruct-2507-GGUF:Q4_0

@ggerganov

ggerganov Sep 24, 2025
Maintainer

Interesting approach! I'll take a look at the PR soon.

Btw, having an installer that does not require model to be specified would be also useful. I suppose this is currently like this for ease of use, but I think eventually the installer should run without setting a model.

@angt

angt Sep 24, 2025
Collaborator

True! I've updated the script, so now you can run it without any arguments:

$ curl -s https://angt.github.io/installama.sh | sh
Run ~/.installama/llama-server to launch the llama.cpp server

It's still an early PoC, so everything can be changed or improved :)

dlmsf
Oct 7, 2025

in my nodejs project where have the llama.cpp in textgeneration module, i use like that

bash install.sh
ai | chat | generate | webgpt

working in ubuntu and alpine, maybe adding support to all distros can be a good option, direct and minimalistic

0 replies

deadprogram
Oct 18, 2025

I just setup this build process for precompiled binaries of llama.cpp for Ubuntu 24.04 with NVidia 12.9 drivers.

Just in case it helps anyone else 😸

https://github.com/hybridgroup/llama-cpp-builder

0 replies

Danmoreng
Oct 24, 2025

Maybe interesting for someone who wants to build from source under Windows. I made a Powershell Script which installs all the necessary prerequisites for a CUDA build, pulls the repo and builds it.
https://github.com/Danmoreng/llama.cpp-installer

0 replies

rosmur
Oct 24, 2025

First thanks for this initiative - I think making it easier for more folk to use llama.cpp directly instead of using commercial middleware like LMStudio, ollama is a win-win for everyone.

I think the docs can use a lot of improvement to make them more accessible, organized, more complete, more examples etc. (this will in turn also help AI search) I started this a bit here: #15709

Happy to contribute more if it would be useful. Even asking AI to cleanup, reorganize etc. the current docs would be a net positive

0 replies

lakano
Oct 25, 2025

As a downstream consumer, I would love to have a binary release of hip-radeon (ROCm) for ubuntu, as you already do for windows. In the same time, there is still a missing docker image for this backend ( #11913 ). So we are still forced to build manually the source at each release.

0 replies

[FEEDBACK] Better packaging for llama.cpp to support downstream consumers 🤗 #15313

Uh oh!

Uh oh!

Vaibhavs10 Aug 14, 2025 Collaborator

Replies: 46 comments · 58 replies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

slaren Aug 14, 2025 Maintainer

Uh oh!

ericcurtin Aug 15, 2025 Collaborator

Uh oh!

qnixsynapse Aug 14, 2025 Collaborator

Uh oh!

slaren Aug 14, 2025 Maintainer

Uh oh!

qnixsynapse Aug 14, 2025 Collaborator

Uh oh!

slaren Aug 14, 2025 Maintainer

Uh oh!

qnixsynapse Aug 14, 2025 Collaborator

Uh oh!

Uh oh!

Uh oh!

Uh oh!

allozaur Aug 14, 2025 Collaborator

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

slaren Aug 15, 2025 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Vaibhavs10
Aug 14, 2025
Collaborator

Replies: 46 comments 58 replies

slaren
Aug 14, 2025
Maintainer

ericcurtin Aug 15, 2025
Collaborator

qnixsynapse
Aug 14, 2025
Collaborator

slaren Aug 14, 2025
Maintainer

qnixsynapse Aug 14, 2025
Collaborator

slaren Aug 14, 2025
Maintainer

qnixsynapse Aug 14, 2025
Collaborator

allozaur Aug 14, 2025
Collaborator

slaren Aug 15, 2025
Maintainer