Best value: RTX 3090 (used)
A used RTX 3090 at 900ドル gives you 24GB VRAM and 936 GB/s bandwidth -- nearly matching the 4090 for RAG capacity at roughly half the price. The trade-off is higher power draw (350W) and older architecture, but for a dedicated RAG server, it is hard to beat.
See the recommended pick on the original guide
See the recommended pick on the original guide
See the recommended pick on the original guide
RAG optimization tips
Run embedding on CPU if VRAM is tight. Modern embedding models like BGE-small or e5-base run fast enough on CPU for most RAG setups. Reserve all your VRAM for the LLM.
Use smaller quantization for the LLM, not shorter context. In RAG, context quality matters more than model precision. A 13B Q4 model with 16K context produces better answers than a 13B Q6 model with 4K context.
Consider splitting stages. Embed documents in batch (overnight if needed), then run inference on a smaller card. The embedding stage is a one-time cost per document.
For more on VRAM planning, see our VRAM requirements guide. If you are on a tighter budget, check our best budget GPU for LLM recommendations. Building a pipeline specifically for document summarization rather than Q&A? Our LLM summarization GPU guide covers the context-length requirements that matter most for that task. If your RAG runs on sensitive corporate or medical data, our best GPU for private AI guide covers the air-gapped deployment angle.
For RAG, buy for VRAM first and bandwidth second. The model plus the context window must fit entirely in GPU memory, or performance falls off a cliff.
Related guides on Best GPU for LLM
Continue on Best GPU for LLM for the complete guide with interactive calculators and current GPU prices.