Skip to content
DEV Community

DEV Community

Collapse Expand
ashahin profile image
Abdullah Shahin
Building http://hivein.ai. Most agents look great in a demo β€” we instrument why they break in prod. The landing page is itself an agent built on us.
  • Joined

Dispatch-bound at batch=1 is the right read.

The thing that flips it: put the embedder behind an async queue and have the server coalesce 32-128 docs with a ~5-10ms micro-batching window. bge-small on a modest CUDA card pulls ahead of CPU once you amortize dispatch. Single-query benchmarks won't show this.

On the CUDA 13 / driver 535 mismatch: python -c 'import torch; print(torch.version.cuda)' against nvcc --version is a faster smoke test than waiting for is_available() to lie quietly.

MPS caveat: unified memory removes the H2D copy but the MPS graph compiler re-traces on shape changes. Variable-length batching can erase the gain unless you bucket by token count.

Collapse Expand
markofrei919 profile image
Marko Frei
Developer focused on building reliable software, learning modern technologies, solving real-world problems, and collaborating with the global dev community.
  • Joined

Great writeup, and the dispatch-bound vs compute-bound framing is the bit that generalizes.

One thing I'd be curious about on the CPU column: did you keep it to stock PyTorch for an apples-to-apples comparison, or try exporting bge-small to ONNX and running it through ONNX Runtime / OpenVINO with int8 quantization? In my experience that drops CPU latency on small embedders a fair bit further, which would only widen the point you're making. At batch=1 an optimized CPU path can end up being the pragmatic default, and you skip the entire driver-ABI saga.

Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink.

For further actions, you may consider blocking this person and/or reporting abuse

DEV Community

We're a place where coders share, stay up-to-date and grow their careers.

Log in Create account

AltStyle γ«γ‚ˆγ£γ¦ε€‰ζ›γ•γ‚ŒγŸγƒšγƒΌγ‚Έ (->γ‚ͺγƒͺγ‚ΈγƒŠγƒ«) /