ATOM (AiTer Optimized Model) is a lightweight vLLM-like implementation, focusing on integration and optimization based on AITER.
- ROCm Optimized: Built on AMD's ROCm platform with AITER kernels (ASM, CK, Triton)
- OpenAI-Compatible API: Drop-in server with
/v1/chat/completionsand/v1/completionsendpoints - Piecewise torch.compile: 4 compilation levels with CUDA graph capture for low-latency decode
- Multi-GPU Parallelism: Tensor parallelism (TP), data parallelism (DP), and expert parallelism (EP) with MORI all-to-all
- Quantization: FP8, MXFP4, INT8, INT4 with auto-detection from HuggingFace configs
- Speculative Decoding: Multi-Token Prediction (MTP) with EAGLE proposer
- Prefix Caching: xxhash64-based KV cache block sharing across sequences
| Model Family | HF Architecture | Dense/MoE | Notes |
|---|---|---|---|
| Llama | LlamaForCausalLM |
Dense | Llama 2, Llama 3, Llama 3.1 |
| Qwen3 | Qwen3ForCausalLM |
Dense | |
| Qwen3-MoE | Qwen3MoeForCausalLM |
MoE | 128 experts, top-8 routing |
| DeepSeek V2/V3 | DeepseekV3ForCausalLM |
MoE | MLA attention, MTP speculative decoding |
| Mixtral | MixtralForCausalLM |
MoE | 8 experts, top-2 routing |
| GLM-4-MoE | Glm4MoeForCausalLM |
MoE | |
| GPT-OSS | GptOssForCausalLM |
MoE | Sliding window + attention sinks |
| Kimi-K2 | via --trust-remote-code |
MoE | See recipe |
- AMD GPU with ROCm support
- Docker
docker pull rocm/pytorch:rocm7.0.2_ubuntu24.04_py3.12_pytorch_release_2.8.0
docker run -it --network=host \ --device=/dev/kfd \ --device=/dev/dri \ --group-add video \ --cap-add=SYS_PTRACE \ --security-opt seccomp=unconfined \ -v $HOME:/home/$USER \ -v /mnt:/mnt \ -v /data:/data \ --shm-size=16G \ --ulimit memlock=-1 \ --ulimit stack=67108864 \ rocm/pytorch:rocm7.0.2_ubuntu24.04_py3.12_pytorch_release_2.8.0
pip install amd-aiter
git clone https://github.com/ROCm/ATOM.git; pip install ./ATOM| Topic | Description | Guide |
|---|---|---|
| Architecture | System overview, request lifecycle, component design | Architecture Guide |
| Configuration | Config classes, CLI arguments, environment variables | Configuration Guide |
| Model Support | Supported models, weight loading, adding new architectures | Model Support Guide |
| Model Operations | AITER kernel integration, linear/attention/MoE/norm wrappers | Model Ops Guide |
| Scheduling & KV Cache | Batch scheduling, block allocation, prefix caching | Scheduling Guide |
| Compilation | torch.compile levels, CUDA graphs, piecewise compilation | Compilation Guide |
| Distributed | Tensor/data/expert parallelism, multi-GPU deployment | Distributed Guide |
| Serving & Benchmarks | OpenAI API server, benchmarking, profiling, speculative decoding | Serving Guide |
Deployment Recipes:
- Qwen3-235B-A22B -- TP8 + EP with FP8 KV cache
- Kimi-K2-Thinking -- MXFP4 MoE on 4 GPUs
The default optimization level is 3 (piecewise torch.compile with CUDA graphs).
python -m atom.examples.simple_inference --model meta-llama/Meta-Llama-3-8B --kv_cache_dtype fp8
Note: First-time execution may take approximately 10 minutes for model compilation.
Start an OpenAI-compatible server:
# Single GPU python -m atom.entrypoints.openai_server --model Qwen/Qwen3-0.6B --kv_cache_dtype fp8 # Multi-GPU with tensor parallelism python -m atom.entrypoints.openai_server --model deepseek-ai/DeepSeek-R1 --kv_cache_dtype fp8 -tp 8
Profile offline inference:
python -m atom.examples.profile_offline --model Qwen/Qwen3-0.6B --kv_cache_dtype fp8
With custom input/output lengths:
python -m atom.examples.profile_offline --model Qwen/Qwen3-0.6B --kv_cache_dtype fp8 \ --random-input --input-length 1024 --output-length 32
Profile a running server:
curl -s -S -X POST http://127.0.0.1:8000/start_profile
# ... run your workload ...
curl -s -S -X POST http://127.0.0.1:8000/stop_profileRun an online throughput benchmark against a running server:
MODEL=deepseek-ai/DeepSeek-R1 ISL=1024 OSL=1024 CONC=128 PORT=8000 RESULT_FILENAME=Deepseek-R1-result python -m atom.benchmarks.benchmark_serving \ --model=$MODEL --backend=vllm --base-url=http://localhost:$PORT \ --dataset-name=random \ --random-input-len=$ISL --random-output-len=$OSL \ --random-range-ratio 0.8 \ --num-prompts=$(( $CONC * 10 )) \ --max-concurrency=$CONC \ --request-rate=inf --ignore-eos \ --save-result --percentile-metrics="ttft,tpot,itl,e2el" \ --result-dir=./ --result-filename=$RESULT_FILENAME.json
For more information, visit InferenceMAX.
Install lm-eval to test model accuracy:
pip install lm-eval[api]
Start a server, then run the evaluation:
python -m atom.entrypoints.openai_server --model meta-llama/Meta-Llama-3-8B --kv_cache_dtype fp8
lm_eval --model local-completions \ --model_args model=meta-llama/Meta-Llama-3-8B,base_url=http://localhost:8000/v1/completions,num_concurrent=64,max_retries=3,tokenized_requests=False \ --tasks gsm8k \ --num_fewshot 5
This project was adapted from nano-vllm.
We welcome issues and contributions! Please use the GitHub Issues page to report bugs or request features: https://github.com/ROCm/ATOM/issues