[3/N] add prefetch support for CUDA backend : running ds4 for any GPU with cache (2.75 x faster!)#402

Open

yiakwy-xpu-ml-framework-team wants to merge 1 commit into

antirez:main from

yiakwy-xpu-ml-framework-team:add_prefetch_cache_support_for_cuda

Open

[3/N] add prefetch support for CUDA backend : running ds4 for any GPU with cache (2.75 x faster!) #402
yiakwy-xpu-ml-framework-team wants to merge 1 commit into
antirez:main from
yiakwy-xpu-ml-framework-team:add_prefetch_cache_support_for_cuda

Conversation

@yiakwy-xpu-ml-framework-team

@yiakwy-xpu-ml-framework-team yiakwy-xpu-ml-framework-team commented Jun 12, 2026 •

edited

Loading

Copy link

Copy Markdown

Introduction

We have verified our sft/rl (much stronger dsv4) with 2 bits can run in 15 tokens/sec.

Then it came into my mind, if I run it in other GPU with UVM technologies (Mapping CPU memory to GPU memory) with prefetch cache ?

For example , we have 80 GB GPU, and we preload tensors with 64 GB from 154 GB model.

Now we did it.

Server side snapshot
截屏2026年06月13日 02 58 25
截屏2026年06月13日 02 58 15

** client side snapshot **
087e126d257ee6806d4d640230a6e6c6

Acceleration

Config	Speed (tokens/sec)	Model (GB)
dsv4 iq2_xxs	15	81
q4	2	154
q4 + 64GB cache	5.5	154

Discussion

This is follow up of #368 and #377, but we can merge it independently since it works on runtime engine not quantization toolkits.

@yiakwy-xpu-ml-framework-team


 add prefetch for CUDA backend , running ds4 for any GPU with cache

cc18da2

acceleration

@yiakwy-xpu-ml-framework-team

yiakwy-xpu-ml-framework-team commented Jun 12, 2026

Copy link

Copy Markdown

Author

@antirez Sorry for disturbing you again! But this is a real important feature with prefetch cache support in CUDA backend (any cuda) !

@yiakwy-xpu-ml-framework-team

yiakwy-xpu-ml-framework-team commented Jun 12, 2026

Copy link

Copy Markdown

Author

How to use it:

DS4_SFT_E2_FP4_MODEL=./gguf/DeepSeek-V4-Flash_e2_v1_Q4KExperts-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf
# ds4: --ssd-streaming is currently supported only with --metal
# --ssd-streaming \
# export DS4_CUDA_MODEL_PRELOAD_SIZE_GB=64
export DS4_CUDA_WEIGHT_CACHE_LIMIT_GB=64
# export DS4_CUDA_WEIGHT_PRELOAD=1
export DS4_CUDA_WEIGHT_CACHE=1
# for debugging
export DS4_CUDA_MODEL_COPY_VERBOSE=1
export DS4_CUDA_WEIGHT_CACHE_VERBOSE=1
# important !
export DS4_CUDA_COPY_MODEL_CHUNKED=1
CUDA_VISIBLE_DEVICES=2 \
DS4_MODEL_NAME="deepseek-v4-flash-rl-e5" \
DS4_LOCK_FILE=/tmp/ds4-server-2.lock ./ds4-server \
 --cuda \
 -m $DS4_SFT_E2_FP4_MODEL \
 --ssd-streaming-cache-experts 64GB \
 --ctx 256000 \
 --kv-disk-dir /raid/yiakwy/tmp/ds4-kv-gpu4 \
 --kv-disk-space-mb 102400 \
 --host 127.0.0.1 \
 --port 8001

@yiakwy-xpu-ml-framework-team yiakwy-xpu-ml-framework-team changed the title ~~(削除) [3/N] add prefetch support for CUDA backend : running ds4 for any GPU with cache (3x faster!) (削除ここまで)~~ (追記) [3/N] add prefetch support for CUDA backend : running ds4 for any GPU with cache (2.75 x faster!) (追記ここまで)

Jun 12, 2026

This was referenced Jun 12, 2026

## Add prefetch cache support for cuda yiakwy-xpu-ml-framework-team/thirdparty-ds4-ultra-low-bit-fork#1

Open

Add me to maintainer list, then I can add copilot to review codes automatically #403

Open

CUDA backend (DGX-Spark) — refactored into modular .cuh files mirroring ROCm structure #398

Open

ROCm runtime: configurable weight cache limit and arena chunk size via environment variables #397

Open

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[3/N] add prefetch support for CUDA backend : running ds4 for any GPU with cache (2.75 x faster!)#402

[3/N] add prefetch support for CUDA backend : running ds4 for any GPU with cache (2.75 x faster!) #402
yiakwy-xpu-ml-framework-team wants to merge 1 commit into
antirez:main from
yiakwy-xpu-ml-framework-team:add_prefetch_cache_support_for_cuda

Conversation

@yiakwy-xpu-ml-framework-team yiakwy-xpu-ml-framework-team commented Jun 12, 2026 •

edited

Loading

Uh oh!

Introduction

Acceleration

Discussion

Uh oh!

yiakwy-xpu-ml-framework-team commented Jun 12, 2026

Uh oh!

yiakwy-xpu-ml-framework-team commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

@yiakwy-xpu-ml-framework-team yiakwy-xpu-ml-framework-team commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Introduction

Acceleration

Discussion

Uh oh!

yiakwy-xpu-ml-framework-team commented Jun 12, 2026

Uh oh!

yiakwy-xpu-ml-framework-team commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

@yiakwy-xpu-ml-framework-team yiakwy-xpu-ml-framework-team commented Jun 12, 2026 •

edited

Loading