GitHub - ROCm/ATOM: AiTer Optimized Model

Name	Name	Last commit message	Last commit date
Latest commit History 304 Commits
.github	.github
atom	atom
docker	docker
docs	docs
recipes	recipes
scripts/performance	scripts/performance
.gitignore	.gitignore
LICENSE	LICENSE
README.md	README.md
atom_overview_draft.md	atom_overview_draft.md
pyproject.toml	pyproject.toml

logo

ATOM (AiTer Optimized Model) is a lightweight vLLM-like implementation, focusing on integration and optimization based on AITER.

🚀 Features

ROCm Optimized: Built on AMD's ROCm platform with AITER kernels (ASM, CK, Triton)
OpenAI-Compatible API: Drop-in server with /v1/chat/completions and /v1/completions endpoints
Piecewise torch.compile: 4 compilation levels with CUDA graph capture for low-latency decode
Multi-GPU Parallelism: Tensor parallelism (TP), data parallelism (DP), and expert parallelism (EP) with MORI all-to-all
Quantization: FP8, MXFP4, INT8, INT4 with auto-detection from HuggingFace configs
Speculative Decoding: Multi-Token Prediction (MTP) with EAGLE proposer
Prefix Caching: xxhash64-based KV cache block sharing across sequences

Supported Models

Model Family	HF Architecture	Dense/MoE	Notes
Llama	`LlamaForCausalLM`	Dense	Llama 2, Llama 3, Llama 3.1
Qwen3	`Qwen3ForCausalLM`	Dense
Qwen3-MoE	`Qwen3MoeForCausalLM`	MoE	128 experts, top-8 routing
DeepSeek V2/V3	`DeepseekV3ForCausalLM`	MoE	MLA attention, MTP speculative decoding
Mixtral	`MixtralForCausalLM`	MoE	8 experts, top-2 routing
GLM-4-MoE	`Glm4MoeForCausalLM`	MoE
GPT-OSS	`GptOssForCausalLM`	MoE	Sliding window + attention sinks
Kimi-K2	via `--trust-remote-code`	MoE	See recipe

📋 Requirements

AMD GPU with ROCm support
Docker

🛠️ Installation

1. Pull Docker Image

docker pull rocm/pytorch:rocm7.0.2_ubuntu24.04_py3.12_pytorch_release_2.8.0

2. Run Docker Container

docker run -it --network=host \
 --device=/dev/kfd \
 --device=/dev/dri \
 --group-add video \
 --cap-add=SYS_PTRACE \
 --security-opt seccomp=unconfined \
 -v $HOME:/home/$USER \
 -v /mnt:/mnt \
 -v /data:/data \
 --shm-size=16G \
 --ulimit memlock=-1 \
 --ulimit stack=67108864 \
 rocm/pytorch:rocm7.0.2_ubuntu24.04_py3.12_pytorch_release_2.8.0

3. Clone and Setup

pip install amd-aiter
git clone https://github.com/ROCm/ATOM.git; pip install ./ATOM

📚 Documentation

Topic	Description	Guide
Architecture	System overview, request lifecycle, component design	Architecture Guide
Configuration	Config classes, CLI arguments, environment variables	Configuration Guide
Model Support	Supported models, weight loading, adding new architectures	Model Support Guide
Model Operations	AITER kernel integration, linear/attention/MoE/norm wrappers	Model Ops Guide
Scheduling & KV Cache	Batch scheduling, block allocation, prefix caching	Scheduling Guide
Compilation	torch.compile levels, CUDA graphs, piecewise compilation	Compilation Guide
Distributed	Tensor/data/expert parallelism, multi-GPU deployment	Distributed Guide
Serving & Benchmarks	OpenAI API server, benchmarking, profiling, speculative decoding	Serving Guide

Deployment Recipes:

Qwen3-235B-A22B -- TP8 + EP with FP8 KV cache
Kimi-K2-Thinking -- MXFP4 MoE on 4 GPUs

💡 Usage

Basic Example

The default optimization level is 3 (piecewise torch.compile with CUDA graphs).

python -m atom.examples.simple_inference --model meta-llama/Meta-Llama-3-8B --kv_cache_dtype fp8

Note: First-time execution may take approximately 10 minutes for model compilation.

Serving

Start an OpenAI-compatible server:

# Single GPU
python -m atom.entrypoints.openai_server --model Qwen/Qwen3-0.6B --kv_cache_dtype fp8
# Multi-GPU with tensor parallelism
python -m atom.entrypoints.openai_server --model deepseek-ai/DeepSeek-R1 --kv_cache_dtype fp8 -tp 8

Profiling

Profile offline inference:

python -m atom.examples.profile_offline --model Qwen/Qwen3-0.6B --kv_cache_dtype fp8

With custom input/output lengths:

python -m atom.examples.profile_offline --model Qwen/Qwen3-0.6B --kv_cache_dtype fp8 \
 --random-input --input-length 1024 --output-length 32

Profile a running server:

curl -s -S -X POST http://127.0.0.1:8000/start_profile
# ... run your workload ...
curl -s -S -X POST http://127.0.0.1:8000/stop_profile

Benchmarking

Run an online throughput benchmark against a running server:

MODEL=deepseek-ai/DeepSeek-R1
ISL=1024
OSL=1024
CONC=128
PORT=8000
RESULT_FILENAME=Deepseek-R1-result
python -m atom.benchmarks.benchmark_serving \
 --model=$MODEL --backend=vllm --base-url=http://localhost:$PORT \
 --dataset-name=random \
 --random-input-len=$ISL --random-output-len=$OSL \
 --random-range-ratio 0.8 \
 --num-prompts=$(( $CONC * 10 )) \
 --max-concurrency=$CONC \
 --request-rate=inf --ignore-eos \
 --save-result --percentile-metrics="ttft,tpot,itl,e2el" \
 --result-dir=./ --result-filename=$RESULT_FILENAME.json

📊 Performance

Online Serving Throughput

DS R1 Performance

For more information, visit InferenceMAX.

Accuracy Validation

Install lm-eval to test model accuracy:

pip install lm-eval[api]

Start a server, then run the evaluation:

python -m atom.entrypoints.openai_server --model meta-llama/Meta-Llama-3-8B --kv_cache_dtype fp8

lm_eval --model local-completions \
 --model_args model=meta-llama/Meta-Llama-3-8B,base_url=http://localhost:8000/v1/completions,num_concurrent=64,max_retries=3,tokenized_requests=False \
 --tasks gsm8k \
 --num_fewshot 5

Acknowledgements

This project was adapted from nano-vllm.

Support & Reporting Issues

We welcome issues and contributions! Please use the GitHub Issues page to report bugs or request features: https://github.com/ROCm/ATOM/issues

License

ROCm/ATOM

Folders and files

Latest commit

History

Repository files navigation

🚀 Features

Supported Models

📋 Requirements

🛠️ Installation

1. Pull Docker Image

2. Run Docker Container

3. Clone and Setup

📚 Documentation

💡 Usage

Basic Example

Serving

Profiling

Benchmarking

📊 Performance

Online Serving Throughput

Accuracy Validation

Acknowledgements

Support & Reporting Issues

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 32

Uh oh!

Languages

Packages