Great writeup, and the dispatch-bound vs compute-bound framing is the bit that generalizes.
One thing I'd be curious about on the CPU column: did you keep it to stock PyTorch for an apples-to-apples comparison, or try exporting bge-small to ONNX and running it through ONNX Runtime / OpenVINO with int8 quantization? In my experience that drops CPU latency on small embedders a fair bit further, which would only widen the point you're making. At batch=1 an optimized CPU path can end up being the pragmatic default, and you skip the entire driver-ABI saga.
For further actions, you may consider blocking this person and/or reporting abuse
We're a place where coders share, stay up-to-date and grow their careers.
Dispatch-bound at batch=1 is the right read.
The thing that flips it: put the embedder behind an async queue and have the server coalesce 32-128 docs with a ~5-10ms micro-batching window. bge-small on a modest CUDA card pulls ahead of CPU once you amortize dispatch. Single-query benchmarks won't show this.
On the CUDA 13 / driver 535 mismatch:
python -c 'import torch; print(torch.version.cuda)'againstnvcc --versionis a faster smoke test than waiting foris_available()to lie quietly.MPS caveat: unified memory removes the H2D copy but the MPS graph compiler re-traces on shape changes. Variable-length batching can erase the gain unless you bucket by token count.