Blackwell-optimized llama.cpp Docker image – works on all NVIDIA GPUs, but tuned for RTX 50 series. Built from scratch with CUDA 12.8, sm_120, NVFP4-ready. 250+ tok/s on 4B F16. Includes llama-chat script.
-
Updated
Mar 28, 2026
Blackwell-optimized llama.cpp Docker image – works on all NVIDIA GPUs, but tuned for RTX 50 series. Built from scratch with CUDA 12.8, sm_120, NVFP4-ready. 250+ tok/s on 4B F16. Includes llama-chat script.
Windows prebuilt of llama.cpp combining Multi-Token Prediction (MTP) + TurboQuant KV cache compression + native sm_120 (Blackwell consumer GPU, FP4 tensor cores). For RTX 5060 Ti / 5070 / 5080 / 5090.
Optimized CSM-1B TTS pipeline for RTX 5090 (Blackwell sm_120). CUDA graph replay via patched HF Transformers. ~0.46x RTF. Topics (tags): csm text-to-speech rtx-5090 blackwell cuda-graphs torch-compile sesame streaming pytorch
GEN3C: Generative Novel 3D Captions - Adapted for NVIDIA Blackwell GPU architecture (sm_120). Includes automatic GPU detection, CPU-based T5 text encoding for Blackwell compatibility, and full backward compatibility with older GPUs.
OpenAlchemy fork of whisper.cpp — RTX 50-series Blackwell (sm_120) NULL-slot guards on top of upstream. Powers /v1/audio/transcriptions
Run llama.cpp with Multi-Token Prediction and TurboQuant on Windows using native sm_120 Blackwell support for RTX 50-series GPUs.
Run Blackwell-optimized llama.cpp inference in a ready-to-use NVIDIA Docker image with CUDA 12.8, sm_120, and NVFP4 support
Add a description, image, and links to the sm-120 topic page so that developers can more easily learn about it.
To associate your repository with the sm-120 topic, visit your repo's landing page and select "manage topics."