This library is designed for efficient inference using BitMamba2 models with 255M and 1B parameters. It implements support for quantization and BitNet-optimized architectures.
License: MIT DOI Hugging Face 1B Hugging Face 255M
Supported: Intel Haswell (4th Gen) or newer, AMD Ryzen.
Not currently supported: ARM devices (Raspberry Pi, Apple Silicon, Android) require a port to NEON intrinsics.
Note: The architectural efficiency (250MB RAM usage) makes it theoretically ideal for Edge devices, but this specific demo code is optimized for x86 desktops.
Use the scripts/export_bin.py script to convert your PyTorch/JAX checkpoints to the optimized C++ binary format.
Arguments:
--version: Model version to export (1bor250m).--ckpt_path: Path to the checkpoint file (.msgpack).--output_name: Output binary filename.
python3 scripts/export_bin.py --version 1b --ckpt_path ./bitmamba_1b.msgpack --output_name bitmamba_1b.bin
python3 scripts/export_bin.py --version 250m --ckpt_path ./bitmamba_250m.msgpack --output_name bitmamba_250m.bin
Ensure you have CMake installed (sudo apt install cmake or equivalent).
cmake -B build cmake --build build
The executable will be located at build/bitmamba.
The matmul kernel compiles a single-instruction int8 dot-product (VPDPBUSD)
when the target supports AVX-VNNI, falling back to an AVX2 maddubs path
otherwise. Select the feature level with -DBITMAMBA_ISA:
| Value | Flags | Use |
|---|---|---|
native (default) |
-march=native |
Build == run machine. Picks up AVX-VNNI automatically on Alder Lake+ / Zen4+. Best local performance. |
avx2 |
-march=x86-64-v3 |
Portable AVX2+FMA+BMI2 binary (no VNNI). Use when cross-building or shipping one binary to mixed CPUs. |
avxvnni |
+ -mavxvnni |
Force the VNNI kernel on a portable v3 baseline. |
avx512 |
+ AVX-512 VNNI |
AVX-512 capable servers. |
# portable binary that still runs the VNNI kernel where the CPU has it:
cmake -B build -DBITMAMBA_ISA=avxvnni
cmake --build buildTip: this is single-thread compute that scales with cores. Set
OMP_NUM_THREADSto the number of physical cores on the deployment VM (e.g.OMP_NUM_THREADS=2on a 2-vCPU instance) to avoid hyperthread oversubscription.
If you prefer g++:
g++ -O3 -march=native -fopenmp -Iinclude -Isrc -o bitmamba examples/main.cpp src/*.cppBitMamba-2 1B
wget https://huggingface.co/Zhayr1/BitMamba-2-1B/resolve/main/bitmamba_cpp/bitmamba_1b.bin
BitMamba-2 0.25B
wget https://huggingface.co/Zhayr1/BitMamba-2-0.25B/resolve/main/bitmamba_cpp/bitmamba_255m.bin
Once you have the binary model (.bin) and the compiled executable, use the exported binary to run inference.
Example command:
./build/bitmamba <model.bin> "<prompt_tokens>" <mode> <temp> <repeat_penalty> <top_p> <top_k> <max_tokens>
Tokenizer mode:
./build/bitmamba bitmamba_1b.bin "Hello, I am" tokenizer 0.7 1.1 0.05 0.9 40 200Raw mode:
./build/bitmamba bitmamba_1b.bin "15496 11 314 716" raw 0.7 1.1 0.05 0.9 40 200Tokenizer mode:
./bitmamba bitmamba_1b.bin "Hello, I am" tokenizer 0.7 1.1 0.05 0.9 40 200Raw mode:
./bitmamba bitmamba_1b.bin "15496 11 314 716" raw 0.7 1.1 0.05 0.9 40 200This command runs the bitmamba_1b.bin model with a tokenized prompt, temperature 0.7, repetition penalty 1.1, generating 200 tokens.
If you use raw mode, you can use the scripts/decoder.py script to convert token IDs back into text.
Usage:
python scripts/decoder.py "tokens"Example:
python scripts/decoder.py "15496 11 314 716"- Future Work: Add ARM/NEON support for Raspberry Pi deployment.
For OpenAI-compatible API access, a Python FastAPI server is available:
# Install Python dependencies cd python pip install -r requirements.txt # Start the server python server.py --model ../bitmamba_1b.bin --host 127.0.0.1 --port 8000
The server provides OpenAI-compatible endpoints:
/v1/chat/completions- Chat completions/v1/completions- Text completions/v1/models- List models
See python/README.md for full documentation.
This implementation is based on the approach described by David Noel in his blog post on RYS.
The C++ binary supports virtual layer repetition at zero extra weight-memory cost via the --repeat-start, --repeat-end, and --repeat-count flags. The same physical layer is executed multiple times with independent recurrent state, which can improve reasoning on certain prompts depending on the chosen slice.
Use scripts/brain_scanner.py to grid-search the best slice for your model on BoolQ + ARC-Easy:
python3 scripts/brain_scanner.py \ --binary ./build/bitmamba \ --model bitmamba_1b.bin \ --range-start 0 --range-end 31 --min-span 2 \ --log brain_scan_1b.csv
Then run inference with the chosen slice:
./build/bitmamba --repeat-start 17 --repeat-end 21 \
bitmamba_1b.bin "Hello, I am" tokenizer 0.7 1.1 0.05 0.9 40 200brain_scanner.py requires the optional dependencies tiktoken and datasets (already listed in requirements.txt).
Use the scripts/fast_inference.py script to evaluate the models:
Weights for 250M version:
wget https://huggingface.co/Zhayr1/BitMamba-2-0.25B/resolve/main/jax_weights/bitmamba_255m.msgpack
Weights for 1B version:
wget https://huggingface.co/Zhayr1/BitMamba-2-1B/resolve/main/jax_weights/bit_mamba_1b.msgpack
python scripts/fast_inference.py --ckpt bitmamba_255m.msgpack --version 250m --eval
python scripts/fast_inference.py --ckpt bit_mamba_1b.msgpack --version 1b --eval