Your GPU Probably Isn't Helping Your Retrieval System

DEV Community

asenmitrev profile image

Asen Mitrev

I build things.

Location

Sofia, Bulgaria
Education

University of Manchester
Work

Senior Dev
Joined

Jun 2, 2026

• Jun 2

Invaluable input, thank you. I have 2 3090s, but need to load one LLM on one, probably Qwen 3.6 27B.

ashahin profile image

Abdullah Shahin

Building http://hivein.ai. Most agents look great in a demo — we instrument why they break in prod. The landing page is itself an agent built on us.

Joined

May 28, 2026

• Jun 3

Dispatch-bound at batch=1 is the right read.

The thing that flips it: put the embedder behind an async queue and have the server coalesce 32-128 docs with a ~5-10ms micro-batching window. bge-small on a modest CUDA card pulls ahead of CPU once you amortize dispatch. Single-query benchmarks won't show this.

On the CUDA 13 / driver 535 mismatch: python -c 'import torch; print(torch.version.cuda)' against nvcc --version is a faster smoke test than waiting for is_available() to lie quietly.

MPS caveat: unified memory removes the H2D copy but the MPS graph compiler re-traces on shape changes. Variable-length batching can erase the gain unless you bucket by token count.

markofrei919 profile image

Marko Frei

Developer focused on building reliable software, learning modern technologies, solving real-world problems, and collaborating with the global dev community.

Joined

Jun 5, 2026

• Jun 6

Great writeup, and the dispatch-bound vs compute-bound framing is the bit that generalizes.

One thing I'd be curious about on the CPU column: did you keep it to stock PyTorch for an apples-to-apples comparison, or try exporting bge-small to ONNX and running it through ONNX Runtime / OpenVINO with int8 quantization? In my experience that drops CPU latency on small embedders a fair bit further, which would only widen the point you're making. At batch=1 an optimized CPU path can end up being the pragmatic default, and you skip the entire driver-ABI saga.