Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Zhayr1/bitmamba.cpp

Repository files navigation

BitMamba.cpp Library

This library is designed for efficient inference using BitMamba2 models with 255M and 1B parameters. It implements support for quantization and BitNet-optimized architectures.

License: MIT DOI Hugging Face 1B Hugging Face 255M

Requirements and compatibility

⚠️ Hardware Requirements: This C++ implementation utilizes AVX2 SIMD instructions for high-performance inference on x86 CPUs (Intel/AMD).

Supported: Intel Haswell (4th Gen) or newer, AMD Ryzen.

Not currently supported: ARM devices (Raspberry Pi, Apple Silicon, Android) require a port to NEON intrinsics.

Note: The architectural efficiency (250MB RAM usage) makes it theoretically ideal for Edge devices, but this specific demo code is optimized for x86 desktops.

Usage Instructions

1. Exporting the Model

Use the scripts/export_bin.py script to convert your PyTorch/JAX checkpoints to the optimized C++ binary format.

Arguments:

  • --version: Model version to export (1b or 250m).
  • --ckpt_path: Path to the checkpoint file (.msgpack).
  • --output_name: Output binary filename.

Example for 1B version:

python3 scripts/export_bin.py --version 1b --ckpt_path ./bitmamba_1b.msgpack --output_name bitmamba_1b.bin

Example for 250M version:

python3 scripts/export_bin.py --version 250m --ckpt_path ./bitmamba_250m.msgpack --output_name bitmamba_250m.bin

2. Compile the C++ Inference Engine

Option 1: Using CMake (Recommended)

Ensure you have CMake installed (sudo apt install cmake or equivalent).

cmake -B build
cmake --build build

The executable will be located at build/bitmamba.

Target ISA (BITMAMBA_ISA)

The matmul kernel compiles a single-instruction int8 dot-product (VPDPBUSD) when the target supports AVX-VNNI, falling back to an AVX2 maddubs path otherwise. Select the feature level with -DBITMAMBA_ISA:

Value Flags Use
native (default) -march=native Build == run machine. Picks up AVX-VNNI automatically on Alder Lake+ / Zen4+. Best local performance.
avx2 -march=x86-64-v3 Portable AVX2+FMA+BMI2 binary (no VNNI). Use when cross-building or shipping one binary to mixed CPUs.
avxvnni + -mavxvnni Force the VNNI kernel on a portable v3 baseline.
avx512 + AVX-512 VNNI AVX-512 capable servers.
# portable binary that still runs the VNNI kernel where the CPU has it:
cmake -B build -DBITMAMBA_ISA=avxvnni
cmake --build build

Tip: this is single-thread compute that scales with cores. Set OMP_NUM_THREADS to the number of physical cores on the deployment VM (e.g. OMP_NUM_THREADS=2 on a 2-vCPU instance) to avoid hyperthread oversubscription.

Option 2: Quick Build (Manual)

If you prefer g++:

g++ -O3 -march=native -fopenmp -Iinclude -Isrc -o bitmamba examples/main.cpp src/*.cpp

3. Running Inference

3.1 Download Weights (from Hugging Face)

BitMamba-2 1B

wget https://huggingface.co/Zhayr1/BitMamba-2-1B/resolve/main/bitmamba_cpp/bitmamba_1b.bin

BitMamba-2 0.25B

wget https://huggingface.co/Zhayr1/BitMamba-2-0.25B/resolve/main/bitmamba_cpp/bitmamba_255m.bin

Once you have the binary model (.bin) and the compiled executable, use the exported binary to run inference.

Example command:

./build/bitmamba <model.bin> "<prompt_tokens>" <mode> <temp> <repeat_penalty> <top_p> <top_k> <max_tokens>

Practical Example:

CMake Build

Tokenizer mode:

./build/bitmamba bitmamba_1b.bin "Hello, I am" tokenizer 0.7 1.1 0.05 0.9 40 200

Raw mode:

./build/bitmamba bitmamba_1b.bin "15496 11 314 716" raw 0.7 1.1 0.05 0.9 40 200
Manual Build

Tokenizer mode:

./bitmamba bitmamba_1b.bin "Hello, I am" tokenizer 0.7 1.1 0.05 0.9 40 200

Raw mode:

./bitmamba bitmamba_1b.bin "15496 11 314 716" raw 0.7 1.1 0.05 0.9 40 200

⚠️ IMPORTANT: the tokenizer.bin file must be in the same directory as the bitmamba compiled executable.

This command runs the bitmamba_1b.bin model with a tokenized prompt, temperature 0.7, repetition penalty 1.1, generating 200 tokens.

4. Decoding Tokens

If you use raw mode, you can use the scripts/decoder.py script to convert token IDs back into text.

Usage:

python scripts/decoder.py "tokens"

Example:

python scripts/decoder.py "15496 11 314 716"

TODO

  • Future Work: Add ARM/NEON support for Raspberry Pi deployment.

API Server (Optional)

For OpenAI-compatible API access, a Python FastAPI server is available:

# Install Python dependencies
cd python
pip install -r requirements.txt
# Start the server
python server.py --model ../bitmamba_1b.bin --host 127.0.0.1 --port 8000

The server provides OpenAI-compatible endpoints:

  • /v1/chat/completions - Chat completions
  • /v1/completions - Text completions
  • /v1/models - List models

See python/README.md for full documentation.

Layer Repetition Scanner (RYS — LLM Neuroanatomy)

This implementation is based on the approach described by David Noel in his blog post on RYS.

The C++ binary supports virtual layer repetition at zero extra weight-memory cost via the --repeat-start, --repeat-end, and --repeat-count flags. The same physical layer is executed multiple times with independent recurrent state, which can improve reasoning on certain prompts depending on the chosen slice.

Use scripts/brain_scanner.py to grid-search the best slice for your model on BoolQ + ARC-Easy:

python3 scripts/brain_scanner.py \
 --binary ./build/bitmamba \
 --model bitmamba_1b.bin \
 --range-start 0 --range-end 31 --min-span 2 \
 --log brain_scan_1b.csv

Then run inference with the chosen slice:

./build/bitmamba --repeat-start 17 --repeat-end 21 \
 bitmamba_1b.bin "Hello, I am" tokenizer 0.7 1.1 0.05 0.9 40 200

brain_scanner.py requires the optional dependencies tiktoken and datasets (already listed in requirements.txt).

Python Inference Evaluation test

Use the scripts/fast_inference.py script to evaluate the models:

Get the weights

Weights for 250M version:

wget https://huggingface.co/Zhayr1/BitMamba-2-0.25B/resolve/main/jax_weights/bitmamba_255m.msgpack

Weights for 1B version:

wget https://huggingface.co/Zhayr1/BitMamba-2-1B/resolve/main/jax_weights/bit_mamba_1b.msgpack

250M Version

python scripts/fast_inference.py --ckpt bitmamba_255m.msgpack --version 250m --eval

1B Version

python scripts/fast_inference.py --ckpt bit_mamba_1b.msgpack --version 1b --eval

About

Ultra-lightweight C++ inference engine for BitMamba-2 (1.58-bit SSM). Runs 1B models on consumer CPUs at 50+ tok/s using <700MB RAM. No heavy dependencies.

Topics

Resources

License

Stars

Watchers

Forks

Packages

Contributors

AltStyle によって変換されたページ (->オリジナル) /