Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

PacifAIst/Quansloth

Repository files navigation

🦥 Quansloth: TurboQuant Local AI Server

 ____ _ _ _ 
 / __ \ | | | | | | 
 | | | |_ _ __ _ _ __ ___| | ___ | |_| |__ 
 | | | | | | |/ _` | '_ \ / __| |/ _ \| __| '_ \ 
 | |__| | |_| | (_| | | | |\__ \ | (_) | |_| | | |
 \___\_\\__,_|\__,_|_| |_||___/_|\___/ \__|_| |_|
 [ POWERED BY TURBOQUANT+ | NVIDIA CUDA ]

License: Apache 2.0 Platform: Linux | WSL2 Backend: CUDA

🥉 Achievement Unlocked: 128+ Stars!

Bronze Starstruck Medal

We just bagged our second Starstruck medal!

A massive thank you to this amazing community. Hitting 128 stars—a beautiful power of two—is a huge milestone. Your support, feedback, and contributions are what fuel the code and keep this project leveling up. Let's aim for the next high score (256)! 🚀👾


Breaking the VRAM Wall: Based on the implementation of Google's TurboQuant (ICLR 2026) — Quansloth brings elite KV cache compression to local LLM inference.

Quansloth is a fully private, air-gapped AI server that runs massive context models natively on consumer hardware (like an RTX 3060). By bridging a custom Gradio Python frontend with a highly optimized llama.cpp CUDA backend, Quansloth achieves extreme memory compression, saving up to 75% of VRAM.

🛑 Why Quansloth? (No More GPU Crashes)

Standard LLM inference often hits a "Memory Wall" when processing long documents; as the context grows, the GPU runs out of memory (OOM) and the system crashes.

Quansloth prevents these crashes by:

  • 75% Cache Shrink: Compressing the "memory" of the AI from 16-bit to 4-bit (TurboQuant).
  • Massive Context on Budget GPUs: Run 32k+ token contexts on a 6GB RTX 3060 that would normally require a 24GB RTX 4090.
  • Hardware-Level Stability: Our interface monitors the CUDA backend to ensure the model stays within your GPU's physical limits, allowing for stable, long-form document analysis without the fear of a system hang.

Interface


📸 Interface Preview

Interface


🖥️ OS Compatibility

  • Windows 10/11: Fully Supported (via WSL2 Ubuntu). Features a 1-click .bat launcher.
  • Linux: Fully Supported (Native).
  • macOS: Not officially supported out-of-the-box (backend optimized for NVIDIA CUDA GPUs).

✨ Features

  • TurboQuant Cache Compression: Run 8,192+ token contexts natively on 6GB GPUs without Out-Of-Memory (OOM) crashes.
  • Live Hardware Analytics: The UI physically intercepts the C++ engine logs to report your exact VRAM allocation and savings in real-time.
  • Context Injector: Upload long documents (PDF, TXT, CSV, MD) directly into the chat stream to test the AI's memory limits.
  • Dual-Routing: Auto-scan your local models/ folder, or input custom absolute paths to load any .gguf file.
  • Cyberpunk UI: A sleek, fully responsive dark-mode dashboard built for power users.

🛠️ Prerequisites

  • Windows with WSL2 (Ubuntu) OR native Linux
  • NVIDIA GPU with updated drivers
  • Miniconda or Anaconda installed

🚀 Installation

1. Prepare Python Environment

conda create -n quansloth python=3.10 -y
conda activate quansloth

2. Clone Repository and Requirements

git clone https://github.com/PacifAIst/Quansloth.git
cd Quansloth

3. Run Installer

chmod +x install.sh
./install.sh

🎮 Usage

Adding Models

Download .gguf models (e.g., Llama 3 8B) and place them in:

models/

Start Server (Windows - 1 Click)

  • Use Launch_Quansloth.bat
  • Double-click → auto-launches WSL, Conda, and server

Start Server (Linux / WSL)

conda activate quansloth
python quansloth_gui.py

Connect

http://127.0.0.1:7860

🎛️ Pro Tips

  • Symmetric (Turbo3) → Best overall compression
  • Asymmetric (Q8/Turbo4) → Better for Q4_K_M models (e.g., Qwen)
  • Monitor Hardware Stats for real-time VRAM savings

📜 License & Credits

  • License: This project is licensed under the Apache 2.0 License.
  • Core Technology: Built upon the TurboQuant+ implementation developed by TheTom (@TheTom).
  • Research & Algorithms: The underlying algorithm is based on research from Google Research (arXiv:2504.19874).
  • CUDA Kernels: Special thanks to Gabe Ortiz (signalnine) for porting the CUDA kernels.

👤 Author Dr. Manuel Herrador 📧 mherrador@ujaen.es
University of Jaén (UJA) - Spain


Made with ❤️ for the Local AI Community by PacifAIst

About

Based on the implementation of Google's TurboQuant (ICLR 2026) — Quansloth brings elite KV cache compression to local LLM inference. Quansloth is a fully private, air-gapped AI server that runs massive context models natively on consumer hardware with ease

Topics

Resources

License

Stars

Watchers

Forks

Contributors

AltStyle によって変換されたページ (->オリジナル) /