Here is the full consumer NVLink support table:
| GPU |
Architecture |
NVLink support |
Bandwidth |
| RTX 2080 Ti |
Turing |
Yes (NVLink 2.0) |
100 GB/s |
| RTX 3090 |
Ampere |
Yes (NVLink 3.0) |
112.5 GB/s |
| RTX 3090 Ti |
Ampere |
No |
— |
|
RTX 4070 Ti / 4080 / 4090 |
Ada Lovelace |
No |
— |
|
RTX 5060 Ti / 5070 / 5080 / 5090 |
Blackwell |
No |
— |
| RTX PRO 6000 Blackwell |
Blackwell (workstation) |
Yes (NVLink 5.0) |
1,800 GB/s |
The RTX 3090 Ti, announced the same generation, did not include the NVLink connector — making the base RTX 3090 the last consumer card with it. The RTX 4090 dropped NVLink entirely; NVIDIA stated it used the freed space for additional AI processing circuitry. The RTX 5090 and the rest of the 50-series continue that pattern.
What this means practically: if you want NVLink in a home lab, your only realistic option is a pair of used RTX 3090s with an NVLink bridge. Everything else is PCIe.
The bandwidth reality
To understand what this costs in performance, the numbers:
| Interconnect |
Bandwidth (bidirectional) |
Typical home-lab hardware |
| PCIe 4.0 x16 |
64 GB/s |
Most AMD and Intel desktop platforms |
| PCIe 5.0 x16 |
128 GB/s |
Z790, X670E, AM5 with Gen 5 slot |
| NVLink 3.0 (RTX 3090 pair) |
112.5 GB/s |
RTX 3090 + NVLink bridge |
| NVLink 3.0 (A100 pair) |
600 GB/s |
Data center, out of home-lab budget |
| NVLink 4.0 (H100 pair) |
900 GB/s |
Data center |
One important detail for dual-GPU desktop builds: when you install two cards in a typical consumer motherboard, each card gets x8 PCIe lanes rather than x16, because the CPU's PCIe lanes are split between slots. On PCIe 4.0, x8 = 32 GB/s bidirectional. On PCIe 5.0, x8 = 64 GB/s bidirectional.
GPU-to-GPU communication over PCIe also routes through the CPU memory controller — data moves from GPU 0 → CPU → GPU 1 — which adds latency that direct NVLink connections avoid entirely. The RTX 3090's NVLink bridge is a direct GPU-to-GPU connection at 112.5 GB/s with no CPU hop.
For tensor-parallel inference, where each token processed requires all-reduce operations between GPUs, that bandwidth gap translates directly into throughput. Benchmarks from a 4x RTX 3090 cluster found NVLink improves inference throughput by approximately 50% for 2-GPU tensor-parallel pairs, and around 10% for 4-GPU setups where only half of GPU pairs are bridged and the rest communicate over PCIe.
When a second GPU actually helps — and when it makes things worse
Adding a second GPU is not always an upgrade. The outcome depends entirely on the relationship between model size and your GPU's VRAM.
Scenario 1: Model doesn't fit on one card. If you are trying to run Llama 3.3 70B Q4 (requires ~42 GB) on a single RTX 4090 (24 GB), the model simply cannot load. A second 4090 brings you to 48 GB total and the model runs. In this case, the second card is not optional — it is a requirement.
Scenario 2: Model fits on one card, you add a second anyway. This is where people get surprised. If you are running Ollama with a 14B model that fits comfortably in 24 GB of VRAM, Ollama will automatically detect your second GPU and split layers across both cards. The result, counterintuitively, is slower inference — because every token now requires PCIe data transfers between cards that were not necessary when the model lived on one GPU. Ollama's official documentation confirms this behavior: second GPU accelerates large models that require VRAM pooling; it hurts small models that would otherwise run fully on one card.
Scenario 3: High-concurrency serving. If you are running vLLM and serving 10+ simultaneous users, tensor parallelism across two GPUs can roughly double throughput compared to a single-GPU setup, because both GPUs work on each request in parallel. The PCIe overhead is amortized across many concurrent requests. This is the use case where PCIe multi-GPU genuinely earns its keep even without NVLink.
The decision matrix:
| Situation |
Add second GPU? |
Reasoning |
| 70B+ model, single GPU too small |
Yes, required |
VRAM pooling is the only path |
| Personal use, <14B models |
No — makes it slower |
PCIe overhead > compute gain |
| vLLM serving, 10+ concurrent users |
Yes |
Throughput scales well |
| Fine-tuning / QLoRA |
Cloud instead |
See cloud GPU math
|
| Ollama, model fits on one card |
No |
Ollama adds overhead, not speed |
The RTX 3090 NVLink setup: what it actually buys you
For home-lab users who specifically want NVLink, this is the only practical path. Two used RTX 3090s connected with an NVLink bridge give you:
-
48 GB combined VRAM — enough for Llama 3.3 70B at Q4_K_M with context headroom
-
112.5 GB/s GPU-to-GPU bandwidth — ×ばつ the throughput of PCIe 4.0 x8
-
50% throughput improvement over running the same two 3090s without NVLink in tensor-parallel configurations
Hardware required:
- Two RTX 3090 cards (NOT 3090 Ti — that card has no NVLink connector)
- One NVIDIA NVLink Bridge 4-slot (ASIN B08S1RYPP6 on Amazon, also available from Newegg). Originally 79ドル MSRP; as of May 2026, available on Amazon and eBay in the 50ドル–80 range
- A motherboard with two PCIe x16/x8 slots with sufficient slot spacing for the 4-slot bridge
The thermal reality: Two RTX 3090s at full inference load draw approximately 350W each, putting the combined GPU power draw at ~700W. The NVLink bridge sits between the cards, blocking airflow between them. A dual-3090 NVLink rig almost always requires aftermarket solutions — open-air cases, additional case fans directly above the GPU stack, or liquid cooling. The dual RTX 3090 cooling problem is well-documented and not optional to address. Plan power supply accordingly: a 1200W+ PSU is prudent.
For more context on the RTX 3090's value proposition individually, see Used RTX 3090 in 2026: Still the AI Value King?
Multi-GPU over PCIe: dual RTX 4090 and beyond
For the majority of multi-GPU home-lab builds in 2026 — dual RTX 4090, dual RTX 5090, any combination without NVLink — PCIe is the interconnect. Here is what to expect.
Dual RTX 4090 running Llama 3.3 70B Q4: approximately 25–30 tokens/sec generation speed with vLLM tensor parallelism. A single RTX 4090 cannot run this model at all (insufficient VRAM), so the comparison i