Skip to content
DEV Community

DEV Community

Introducing LlamaStash: a zero-overhead, terminal-native llama.cpp launcher

#ai #llamacpp #localllm #rust
8 reactions
Comments 1 comment
11 min read

This is not a tutorial on how to reproduce every single bit of my setup. My full personal configuration is private because it has too much machine-specific and personal stuff. But I'm making a stripped-down public version with the bare minimum needed for Arch, niri, DMS, OpenCode, and llama.cpp at deepu105/archdots.

This post is more about the current shape of my Linux development machine and why I ended up with this stack.

This is my primary machine for all of the below.

Machine configuration

The configuration of the machine is quite crucial for this setup. Running a browser, a few IDEs, Docker, terminals, and local LLMs is not exactly a light workload.

My current machine is an ASUS ROG Flow Z13 2025 model. It is a weird little beast. It is technically a tablet, but it has enough CPU, GPU, and memory to behave like a mobile workstation.

Here is the current setup.

The memory is the most interesting part here. For normal development work, 32GB is still fine and 64GB is great. But for local AI work, memory changes everything. A 27B quantized model, a large context window, Docker, Chrome, and an editor can happily eat memory like there is no tomorrow.

Having that much unified memory means the machine can run a useful local coding model without feeling like a science experiment. That is a big deal.

Operating system

I praised Fedora in the previous posts, and I still think Fedora is one of the best Linux distributions for most developers. Updates are smooth, new packages land often, and it mostly stays out of the way.

But this time I went with vanilla Arch Linux. So yes, I use Arch btw! πŸ˜‰ I know, rolling release and all that. I have been using Linux long enough to know what I was signing up for.

The main reason was simple: I wanted the latest kernel, Mesa, ROCm-adjacent bits, Wayland tools, and desktop packages without waiting for the next distro release. New hardware like the Flow Z13 usually benefits from being closer to the bleeding edge. Arch gives me that. Well, OK, I also fell in love with the sexy new compositors like niri and Hyprland, and Arch is a great way to run those without waiting for backports. I started with Hyprland, but I ended up liking niri better for my workflow, and Arch made it easy to switch and experiment.

My installation is still fairly boring, and I mean that as a compliment.

I also use Topgrade to keep the system updated. My private config even wires it into DankMaterialShell, so I can see available updates from the bar and trigger an update for everything on the system from pacman/AUR, brew, cargo, npm, VS Code plugins, Docker images, and so on in Kitty.

Again, quite simple, at least in my eyes.

Desktop environment, or lack of one

This is probably the biggest change from my previous setup. I no longer run GNOME or KDE as my main desktop. I use niri, which is a scrollable tiling Wayland compositor.

If you have not used niri, the workflow is quite different from a regular tiling window manager. Instead of forcing everything into a fixed grid, windows live in columns and you scroll horizontally across them. It sounds odd until it clicks. Once it clicks, it feels very natural on ultrawide monitors and laptop displays. I especially love the touchpad gestures for switching workspaces and moving windows around. It is a very fluid way to manage windows.

Scrolling workspaces in niri

My current session looks like this.

Niri and DMS

Niri gives me the compositor. DMS gives me the desktop shell pieces that I would otherwise have to stitch together myself.

DMS replaces a lot of the usual Wayland desktop plumbing:

This is the kind of stuff where I do not want to maintain five different tools and a bunch of scripts if one project does the job well enough. DMS is still young, but it is already quite useful, especially with niri. It's also quite extensible, and I have already started adding tools that I want. For example, a locally saved TODO widget.

The Flow Z13 also needs some special handling. I have fixes for ASUS hotkeys, touchpad behavior, keyboard backlight, Thunderbolt rescans, and Wi-Fi quirks in my private config. The public archdots repo will only carry the reusable bits. This is Linux on new hardware, so of course there are quirks. What is a Linux experience without glitches, right?

Development tools

My development tools are still mostly boring, in a good way. These are subjective choices, and they do not matter as long as you are comfortable with your tools.

My development tools

Shell: I use Zsh with zinit, Powerlevel10k, zoxide, and fzf. I still use a bunch of aliases for Git, Docker, package management, Jekyll, and local AI tools.

Terminal: I use Kitty. I have tabs, splits, clipboard bindings, quick access terminal, and a few custom keybindings. It is fast, it works well on Wayland, and it does not get in my way.

Editors: I use Neovim with LazyVim as my default editor. I still use Visual Studio Code depending on the project and what I am testing.

Toolchains: I use SDKMAN! for JDKs, NVM for Node.js, rustup for Rust, Bun, Go, Python, Deno, and the usual Linux build tools.

DevOps: Docker, Docker Compose, kubectl, kdash, Terraform, Distrobox, and so on. Some come from pacman or AUR, some from Homebrew, and some from language-specific installers.

Offline AI-assisted development

Now to the fun part.

I use cloud AI tools as well, and they are useful. But I also wanted a setup where I can code with an AI assistant without sending code, prompts, logs, or half-written ideas to a remote API. Not because every project is secret, but because local-first tooling is a good capability to have especially in a world that's heading towards techno oligarchy.

My current stack is:

Here is my OpenCode provider config:

{"$schema":"https://opencode.ai/config.json","provider":{"llama.cpp":{"npm":"@ai-sdk/openai-compatible","name":"llama.cpp ROCm (local)","options":{"baseURL":"http://127.0.0.1:18080/v1"},"models":{"qwen3-6-27b-q8-0":{"name":"Qwen3.6 27B Q8_0 (local ROCm)","limit":{"context":262144,"output":16384}},"qwen3-6-27b-q6-k":...,"qwen3-6-27b-q4-k-m":...,"gemma-4-31b-it-q4-k-m":...,"gemma-4-31b-it-q8-0":...}},"openrouter":{"models":{"moonshotai/kimi-k2.6":{"name":"Kimi K2.6 (OpenRouter backup)","limit":{"context":262144,"output":16384}},"deepseek/deepseek-v4-pro":{"name":"DeepSeek V4 Pro (OpenRouter backup)","limit":{"context":1048576,"output":384000}}}}}}
Enter fullscreen mode Exit fullscreen mode

I start the local model server with an alias.

llamaServer
Enter fullscreen mode Exit fullscreen mode

That points to a small script. It lets me pick a GGUF model, context size, and reasoning mode. It remembers the last choice, so most of the time I just start it and get going.

The default model and context right now are:

Qwen3.6-27B-Q8_0.gguf - 256k context
Enter fullscreen mode Exit fullscreen mode

Here is a quick llama-bench comparison of the local models on my machine. The numbers are tokens per second with ROCm, full GPU offload, flash attention, f16 KV cache, a 4096-token prompt, a 256-token generation, and 3 repetitions.

Model Quantization Size Prompt tokens/s Generation tokens/s
Qwen3.6 27B Q4_K_M 15.40 GiB 260.06 10.41
Qwen3.6 27B Q6_K 20.56 GiB 279.37 8.70
Qwen3.6 27B Q8_0 26.62 GiB 260.12 7.18
Gemma 4 31B IT Q4_K_M 17.39 GiB 209.57 9.12
Gemma 4 31B IT Q8_0 30.38 GiB 202.31 6.19

The full context is 256k tokens. Here is a benchmark with full context for the Qwen variants.

Model Quantization Size Prompt+Generation tokens/s
Qwen3.6 27B Q4_K_M 15.40 GiB 67.15
Qwen3.6 27B Q6_K 20.56 GiB 65.77
Qwen3.6 27B Q8_0 26.62 GiB 64.34

Running Qwen3.6 27B Q8_0 with 256k context in reasoning mode loads around 70% of the GPU memory in my setup and gives around 64 tokens/s for prompt+generation. That is quite good for a local model with that much context.

The llama.cpp build is also automated with a small script.

cmake -S /mnt/work/Workspace/llms/llama.cpp \
 -B /mnt/work/Workspace/llms/llama.cpp/build-hip \
 -G Ninja \
 -DGGML_HIP=ON \
 -DAMDGPU_TARGETS=gfx1151 \
 -DCMAKE_BUILD_TYPE=Release
cmake --build /mnt/work/Workspace/llms/llama.cpp/build-hip \
 --config Release \
 -j "$(nproc)" \
 --target llama-server llama-bench
Enter fullscreen mode Exit fullscreen mode

The server runs like this under the hood.

ROCBLAS_USE_HIPBLASLT=1 llama-server \
 --model "$model" \
 --alias "$alias_name" \
 --host 127.0.0.1 \
 --port 18080 \
 --ctx-size "$ctx" \
 --n-gpu-layers 999 \
 --flash-attn on \
 --no-mmap \
 --cache-type-k f16 \
 --cache-type-v f16 \
 --batch-size 4096 \
 --ubatch-size 512 \
 --reasoning "$reasoning"
Enter fullscreen mode Exit fullscreen mode

Once the server is running, OpenCode talks to it like it would talk to any OpenAI-compatible provider. The difference is that the whole loop stays on my machine.

It's very elegant IMO!

I do not only use local models, though. For complex tasks, I also use frontier models through OpenRouter, mostly Kimi K2.6 and DeepSeek V4. Occasionally I use Copilot CLI and at work, I use Claude Code as well.

For the harness, I prefer OpenCode. I do not see any noticeable performance difference between Claude Code and OpenCode with Kimi or DeepSeek for the kind of coding tasks I do, which is mostly open source projects in Rust and TypeScript. That might vary for other people, of course, but for me OpenCode has been quite good and I especially prefer its UX over others. I'm trying Pi on the side as well to see if I keep it in the mix.

Why local AI coding matters to me

Local AI is not a replacement for everything. The best hosted models are still better for many tasks, especially when you need maximum reasoning quality or very fast responses. But local models have their own sweet spot.

For me, the advantages are clear.

But there are tradeoffs.

So no, I do not think everyone should run a local coding model. But if you enjoy owning your stack and you have the hardware for it, it is a very satisfying setup.

The AI workflow

My usual workflow is quite simple.

  1. Start the local model server with llamaServer.
  2. Pick the model and context preset if I want to change it.
  3. Start opencode in the repository and pick a model if I want to change it.
  4. Ask it to inspect the codebase before making changes.
  5. Let it edit, test, and iterate, while I review the changes using the opencode-telegram-bot remotely from Telegram.

For small tasks, I turn reasoning off because it makes tool-heavy work faster. For design questions, debugging, or code review, I turn reasoning on. The script makes that a prompt instead of forcing me to remember a long command.

This is the kind of boring automation I like. It removes friction without hiding what is actually happening.

Productivity and media tools

Most of my productivity stack did not change much.

Browser: Google Chrome is still my primary browser. I also keep Firefox around.

Password management: I use Bitwarden and a YubiKey.

Communication: Zoom, Signal, Telegram, and the usual suspects.

Screen capture: DMS screenshot plugin, screen recorder plugin, and OBS Studio when I need more control.

Images and video: Gimp, Inkscape, Kdenlive, and a few Flatpak utilities like Upscayl and Buzz.

File manager: Dolphin, because KDE apps are still excellent even when KDE is not my main desktop.

What is still not perfect

Of course, not everything is perfect. This is bleeding-edge Linux, on a new ASUS convertible, with a new AMD chip, a Wayland compositor, and a local AI stack. If everything worked perfectly on day one, I would be suspicious.

Some current rough edges are below.

None of these are deal breakers for me. Most are either already fixed in my private config or on my TODO list.

Conclusion

This is easily the most interesting Linux machine I have used so far. My 2019 setup was beautiful, my 2021 setup was sleek, and this one feels like a proper local-first AI development workstation.

Vanilla Arch gives me the latest bits. Niri gives me a workflow that fits both the tiny built-in screen and my ultrawide monitor. DMS gives me the desktop polish without a full desktop environment. And OpenCode plus llama.cpp gives me an AI coding assistant that can run without the cloud.

It is not the right setup for everyone. If you want a machine that never asks you to think about kernels, ROCm, compositor configs, or model files, this is probably not it. But for me, this is exactly the kind of developer machine that sparks joy.

The right tool for the right job.

If you like this article, please leave a like or a comment.

You can follow me on Bluesky and LinkedIn.

Top comments (29)

Subscribe
pic
Create template

Templates let you quickly answer FAQs or store snippets for re-use.

Dismiss
Collapse Expand
webreflection profile image
Andrea Giammarchi
  • Joined

I have a similar machine but it's a Desktop one (minisforum 395+ 128GB) but while I've never looked into its BIOS, I've thought the whole point of these machines was to have similar unified memory DGX spark has, as example (and I have one of those too) ... is there any reason you had to explicitly split 64GB of memory here and there as opposite of letting the machine/OS handle that for you? Specially DS4 project (which I love and use on DGX Spark) requires 96GB minimum to run but it doesn't necessarily need to take all that space, although I believe with a 32GB CPU split and a 96GB for the GPU that project should run, still curious to learn/know why nobody on macOS needs to worry about this, and neither do I on my DGX Spark (or maybe it comes pre-configured to handle that automatically) ... thanks!

That being said, nice post ... I feel you for the AMD ROCm state but it's really getting better day by day, can't wait to have it more reliable/robust to make it the mac alternative for developers!

Collapse Expand
deepu105 profile image
Deepu K Sasidharan
JHipster co-lead, Polyglot dev, Cloud Native Advocate, Developer Advocate @Okta, Author, Speaker, Software craftsman. Loves simple & beautiful code. bit.ly/JHIPSTER-BOOKS
  • Location
    Utrecht, Netherlands
  • Education
    Electrical & Electronics Engineering
  • Work
    Developer advocate at Okta
  • Joined

Last I tried there was some issues in loading models larger than RAM. But I think its not an issue on newer kernels, I'm planning on disabling the split and see how my previous use cases work now.

Collapse Expand
harjjotsinghh profile image
Harjot Singh
21, Engineer, Building moonshift.io

i love that you're focusing on a fully offline setup for AI-assisted development. it’s cool to see how you've customized your environment with arch and niri. if you're ever interested in quickly spinning up a web app, moonshift lets you deploy a next.js + postgres + auth build in about 7 minutes, and you keep the code on your github. let me know if you want to give it a shot for free.

Collapse Expand
adityamitra profile image
Aditya Mitra
Finding out a place to dig a new hole! Find me at https://bsky.app/profile/adityamitra.bsky.social
  • Joined

You should also give omp.sh a try.
I found it much better in speed and management that opencode.

Collapse Expand
pengeszikra profile image
Peter Vivo
The Vibe Archeologist. Creator of mordorjs. |> and touch bar fanatic from Hungary. God speed you! 1John1 + 5John17 |> 1Moses1 = (1Moses2 ... 4.22John21); alpha & omega = !![];
  • Location
    Pomaz
  • Education
    streetwise
  • Work
    full stack developer at TCS
  • Joined

Looks great! I like to use linux, at least unix based terminal. For example my company laptop is a windows11 but the wls install ubuntu 22.4 partial solve my development workflow. I know that is fare from this handcraftect solutions, but the company requriments are strict, even I can't reach the dev.to from some weird company policy from my working computer. Any way I like your work!

Collapse Expand
78q6d profile image
uiqtwe6
asasas

Is it a company laptop?

Collapse Expand
pengeszikra profile image
Peter Vivo
The Vibe Archeologist. Creator of mordorjs. |> and touch bar fanatic from Hungary. God speed you! 1John1 + 5John17 |> 1Moses1 = (1Moses2 ... 4.22John21); alpha & omega = !![];
  • Location
    Pomaz
  • Education
    streetwise
  • Work
    full stack developer at TCS
  • Joined

2020 Dell i5 16GB Ram, worn english layout keyboard, but I always using US layout - minor confusion.
A good news copilot cli running on cloud so that capacity don't effect the computer.

Collapse Expand
deepu105 profile image
Deepu K Sasidharan
JHipster co-lead, Polyglot dev, Cloud Native Advocate, Developer Advocate @Okta, Author, Speaker, Software craftsman. Loves simple & beautiful code. bit.ly/JHIPSTER-BOOKS
  • Location
    Utrecht, Netherlands
  • Education
    Electrical & Electronics Engineering
  • Work
    Developer advocate at Okta
  • Joined

Neah

Collapse Expand
fyodorio profile image
Fyodor
Why'd you (software engineers) have to go and make things (software development) so complicated...
  • Location
    Backwoods
  • Education
    MSc, Royal Holloway University of London
  • Work
    Product Engineer
  • Joined

That's a helluva broputer... πŸ˜…

only for bros

Collapse Expand
deepu105 profile image
Deepu K Sasidharan
JHipster co-lead, Polyglot dev, Cloud Native Advocate, Developer Advocate @Okta, Author, Speaker, Software craftsman. Loves simple & beautiful code. bit.ly/JHIPSTER-BOOKS
  • Location
    Utrecht, Netherlands
  • Education
    Electrical & Electronics Engineering
  • Work
    Developer advocate at Okta
  • Joined

I'm gonna steal broputer πŸ˜‚ although not sure if I should be offended or not 🀣

Collapse Expand
fyodorio profile image
Fyodor
Why'd you (software engineers) have to go and make things (software development) so complicated...
  • Location
    Backwoods
  • Education
    MSc, Royal Holloway University of London
  • Work
    Product Engineer
  • Joined

Nah, no offense, that’s a really cool setup made with lots of love and dedication, I’m pretty sure it pays off big time πŸ‘πŸΌ

Collapse Expand
rajas_poorna_0f9376cca3f6 profile image
Rajas Poorna
  • Joined

Lovely setup!
Have you considered using Qwen3.6 35BA3B?
I use it on my MI50 32GB and basically get a 3x boost in tokens/s (both in and out) for not much intelligence penalty. Also probably worth turning on the feature to remember its thinking, given that you can support its full context window.
Once I saw that kind of tokens/s it was hard to justify the slower dense models.

Collapse Expand
deepu105 profile image
Deepu K Sasidharan
JHipster co-lead, Polyglot dev, Cloud Native Advocate, Developer Advocate @Okta, Author, Speaker, Software craftsman. Loves simple & beautiful code. bit.ly/JHIPSTER-BOOKS
  • Location
    Utrecht, Netherlands
  • Education
    Electrical & Electronics Engineering
  • Work
    Developer advocate at Okta
  • Joined

I haven't personally tried it since I saw someone comparing that with dense models for long context tasks and the MOE models hallucinated way more when context was big. I will try it when I have time and see.

Collapse Expand
deepu105 profile image
Deepu K Sasidharan
JHipster co-lead, Polyglot dev, Cloud Native Advocate, Developer Advocate @Okta, Author, Speaker, Software craftsman. Loves simple & beautiful code. bit.ly/JHIPSTER-BOOKS
  • Location
    Utrecht, Netherlands
  • Education
    Electrical & Electronics Engineering
  • Work
    Developer advocate at Okta
  • Joined
• Edited on • Edited

What context are you using

Collapse Expand
vicchen profile image
Vic Chen
AI builder exploring finance & institutional investing. Building tools to decode how the smart money moves. SF Bay Area.
  • Location
    San Francisco, CA
  • Education
    Stanford University, Computer Science
  • Work
    Founder, building AI tools for finance
  • Joined

This is the dream setup for anyone who cares about owning their stack. The llama.cpp + ROCm combo on the Flow Z13 is impressive β€” 128GB unified memory changes the calculus for local AI entirely. I've been thinking about a similar local-first approach for some of my financial data analysis pipelines where I really don't want prompts hitting third-party APIs. The tradeoff you mentioned about context-length slowdown with 27B models matches what I've seen too. Qwen3.6 Q8_0 at 256k context is a solid sweet spot. Thanks for sharing the bench numbers and the archdots repo β€” exactly the kind of practical detail that's hard to find.

Collapse Expand
v_rai_7a0813fcee9d16 profile image
Vikassh.
  • Joined

Nice article. I never thought about this approach before

Collapse Expand
galileo_g_60bdf6defcc5ae7 profile image
Galileo G
  • Joined

Try Krusader or similar 2 pane keyboard heavy file managers.

Collapse Expand
v_rai_7a0813fcee9d16 profile image
Vikassh.
  • Joined

How has this setup performed under real traffic

Collapse Expand
deepu105 profile image
Deepu K Sasidharan
JHipster co-lead, Polyglot dev, Cloud Native Advocate, Developer Advocate @Okta, Author, Speaker, Software craftsman. Loves simple & beautiful code. bit.ly/JHIPSTER-BOOKS
  • Location
    Utrecht, Netherlands
  • Education
    Electrical & Electronics Engineering
  • Work
    Developer advocate at Okta
  • Joined

I have been using it for reviews, quick fixes, repo research etc and have been quite good. Right now building a full fledged filesystem management TUI in Rust. Will report back my findings. So far very impressed, i'm 3 prompts in and its fxing issues after first iteration.

View full discussion (29 comments)

Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink.

For further actions, you may consider blocking this person and/or reporting abuse

DEV Community

We're a place where coders share, stay up-to-date and grow their careers.

Log in Create account

AltStyle γ«γ‚ˆγ£γ¦ε€‰ζ›γ•γ‚ŒγŸγƒšγƒΌγ‚Έ (->γ‚ͺγƒͺγ‚ΈγƒŠγƒ«) /