Name	Name	Last commit message	Last commit date
Latest commit History 29 Commits
.vscode	.vscode
launchers	launchers
models	models
moe_configs	moe_configs
scripts	scripts
secrets	secrets
.gitignore	.gitignore
AGENT.md	AGENT.md
ARCHITECTURE.md	ARCHITECTURE.md
CONFIGURATION.md	CONFIGURATION.md
Dockerfile	Dockerfile
LAUNCHER_GUIDE.md	LAUNCHER_GUIDE.md
MODELS.md	MODELS.md
README.md	README.md
docker-compose.yml	docker-compose.yml
runMe.sh	runMe.sh
vars.env	vars.env

vLLM Multi-Model Server

A production-ready Docker-based deployment system for running multiple vLLM models with LiteLLM proxy integration. This setup provides OpenAI-compatible API endpoints for various large language models optimized for different use cases.

🚀 Quick Start

Launch a Model

The easiest way to launch a model is using the runMe.sh script:

# List available models
./runMe.sh
# Launch a specific model
./runMe.sh step-3.5-flash
# Launch with rebuild
./runMe.sh glm47-flash --build
# Launch in detached mode (background)
./runMe.sh qwen3-next-coder -d

Alternative: Direct Docker Compose

If you prefer using docker compose directly:

# Set the model and launch
MODEL=step-3.5-flash sudo docker compose up
# With rebuild
MODEL=step-3.5-flash sudo docker compose up --build
# In detached mode
MODEL=glm47-flash sudo docker compose up -d

⚠️ Important: When using sudo docker compose, you must use MODEL=name sudo docker compose up (not sudo MODEL=name docker compose up) to ensure the environment variable is passed correctly.

📋 Available Models

This deployment supports multiple pre-configured models, each optimized for specific use cases:

Model	Use Case	Context	Concurrency
`glm47-flash`	General-purpose reasoning & tool calling	128K	Low (16)
`step-3.5-flash`	Long-context reasoning & problem-solving	Auto	Medium (24)
`step-3.5-flash-hcsw`	High-throughput inference	8K	High (64)
`qwen3-next-coder`	Code generation and analysis	Variable	Medium

For detailed model specifications and configuration, see models/README.md.

🏗️ Architecture

The system consists of three main services:

vllm-node: Runs the vLLM inference engine with the selected model
litellm: Provides a unified OpenAI-compatible API proxy with rate limiting and monitoring
db: PostgreSQL database for LiteLLM's internal state and usage tracking

┌─────────────────┐
│ Your Client │
└────────┬────────┘
 │
 ▼
┌─────────────────┐ ┌──────────────┐
│ LiteLLM │────▶│ PostgreSQL │
│ Port: 4000 │ │ Database │
└────────┬────────┘ └──────────────┘
 │
 ▼
┌─────────────────┐
│ vLLM Engine │
│ Port: 8000 │
└─────────────────┘

🔧 Configuration

Environment Variables

Global configuration is stored in vars.env. Common variables:

# NCCL Configuration
NCCL_P2P_DISABLE=1
# LiteLLM Defaults (optional, can override per-model)
#LITELLM_TEMPERATURE=0.7
#LITELLM_TOP_P=0.8
#LITELLM_MAX_TOKENS=65536

Model-Specific Configuration

Each model has its own configuration file in the models/ directory:

models/glm47-flash.yml
models/step-3.5-flash.yml
models/step-3.5-flash-hcsw.yml
models/qwen3-next-coder.yml

These files define:

The vLLM command with model-specific flags
Environment variables for optimization
LiteLLM API parameters

Secrets Management

Sensitive credentials (API keys, tokens) should be stored in the secrets/ directory:

# Create secrets directory
mkdir -p secrets
# Add secrets as individual files (filename = env var name)
echo "your-hf-token" > secrets/HF_TOKEN
echo "your-api-key" > secrets/ANTHROPIC_API_KEY
# These will be automatically loaded as environment variables

See secrets/README.md for more details.

📡 API Usage

Accessing the API

Once launched, the services are available at:

LiteLLM Proxy: http://localhost:4000 (Recommended - with rate limiting & monitoring)
Direct vLLM: http://localhost:8000 (Direct access, no proxy features)

Example Usage

Python (OpenAI SDK)

from openai import OpenAI
# Use LiteLLM proxy (recommended)
client = OpenAI(
 base_url="http://localhost:4000/v1",
 api_key="sk-FAKE" # Default API key
)
response = client.chat.completions.create(
 model="vllm_agent",
 messages=[
 {"role": "user", "content": "Explain quantum computing in simple terms"}
 ]
)
print(response.choices[0].message.content)

cURL

curl http://localhost:4000/v1/chat/completions \
 -H "Content-Type: application/json" \
 -H "Authorization: Bearer sk-FAKE" \
 -d '{
 "model": "vllm_agent",
 "messages": [{"role": "user", "content": "Hello!"}]
 }'

Using Claude Code with Local Model

Use the launcher scripts in launchers/ to route Claude Code through your local model:

# Launch Claude Code using your local vLLM instance
./launchers/local_claude.sh
# Or add to your PATH for easier access
export PATH="$PWD/launchers:$PATH"
local_claude.sh

See launchers/local_claude.sh for more options.

🛠️ Management Commands

Starting Services

# Foreground (see logs in terminal)
./runMe.sh step-3.5-flash
# Background (detached mode)
./runMe.sh step-3.5-flash -d

Stopping Services

# Stop all services
sudo docker compose down
# Stop and remove volumes (clean slate)
sudo docker compose down -v

Viewing Logs

# All services
sudo docker compose logs -f
# Specific service
sudo docker compose logs -f vllm-node
sudo docker compose logs -f litellm
# Last 100 lines
sudo docker compose logs --tail=100 vllm-node

Switching Models

# Stop current model
sudo docker compose down
# Start with different model
./runMe.sh qwen3-next-coder

Rebuilding After Configuration Changes

# Rebuild container image
./runMe.sh step-3.5-flash --build
# Or with docker compose directly
MODEL=step-3.5-flash sudo docker compose up --build

🔍 Health Checks

Check Service Status

# vLLM health
curl http://localhost:8000/health
# LiteLLM health
curl http://localhost:4000/health/liveliness
# List loaded models
curl -H "Authorization: Bearer sk-FAKE" http://localhost:8000/v1/models

Monitor GPU Usage

# Real-time GPU monitoring
watch -n 1 nvidia-smi
# Docker container stats
sudo docker stats

Check Container Status

# List running containers
sudo docker compose ps
# Check specific container health
sudo docker compose ps vllm-node

🐛 Troubleshooting

Issue: Model Always Defaults to glm47-flash

Problem: Running MODEL="step-3.5-flash" sudo docker compose up still launches glm47-flash.

Cause: The sudo command doesn't preserve environment variables by default.

Solutions:

✅ Use the runMe.sh script (Recommended):
```
./runMe.sh step-3.5-flash
```

✅ Put MODEL before sudo:

MODEL=step-3.5-flash sudo docker compose up

⚠️ Use sudo -E (preserves all env vars):

sudo -E docker compose up
# Note: This exposes ALL environment variables to sudo, which may be a security concern

Issue: Out of Memory (OOM) Errors

Symptoms: Container crashes with CUDA out of memory errors.

Solutions:

Switch to a smaller model or high-concurrency variant:

./runMe.sh step-3.5-flash-hcsw # Smaller context window

Reduce GPU memory utilization (edit model's .yml file):

command: |
 vllm serve ... --gpu-memory-utilization 0.85 # Was 0.96

Close other GPU-using processes:
```
nvidia-smi # Check what's using GPU
```

Issue: Container Fails Health Check

Symptoms: Container marked as unhealthy, LiteLLM can't connect.

Diagnosis:

# Check container logs
sudo docker compose logs vllm-node
# Check if vLLM port is accessible
curl http://localhost:8000/health

Common causes:

Model download in progress (wait longer, check logs)
Insufficient GPU memory (see OOM solutions above)
Model configuration error (check model .yml file syntax)

Issue: Permission Denied

Symptoms: Docker commands fail with permission errors.

Solutions:

Add your user to docker group:

sudo usermod -aG docker $USER
newgrp docker # Activate immediately

Or continue using sudo:

./runMe.sh step-3.5-flash # Script handles sudo automatically (interactive terminal required if password prompt is needed)

Issue: Model Download is Slow

Cause: Models are downloaded from HuggingFace on first run.

Solution:

Be patient! Large models can be 10-50GB
Monitor progress in logs:
```
sudo docker compose logs -f vllm-node
```

Pre-download models:

huggingface-cli download stepfun-ai/Step-3.5-Flash-FP8

Issue: Port Already in Use

Symptoms: Error: "port is already allocated"

Solution:

# Check what's using the port
sudo lsof -i :8000
sudo lsof -i :4000
# Stop conflicting service or change ports in docker-compose.yml

📁 Project Structure

vllm-server/
├── README.md # This file
├── docker-compose.yml # Docker services definition
├── Dockerfile # Container image definition
├── runMe.sh # Simple model launcher script
├── run_vllm_agent.sh # Container entrypoint script
├── vars.env # Global environment variables
├── generate_litellm_config.py # LiteLLM config generator
├── litellm_config.template.yaml # LiteLLM template
│
├── models/ # Model configurations
│ ├── README.md # Model documentation
│ ├── glm47-flash.yml
│ ├── step-3.5-flash.yml
│ ├── step-3.5-flash-hcsw.yml
│ └── qwen3-next-coder.yml
│
├── scripts/ # Helper scripts
│ ├── gen_models_yml.sh # Model config builder
│ └── load_secrets_env.sh # Secrets loader
│
├── secrets/ # Sensitive credentials (gitignored)
│ ├── README.md
│ ├── HF_TOKEN # HuggingFace token
│ └── ANTHROPIC_API_KEY # Anthropic API key
│
└── launchers/ # Client launcher scripts
 ├── local_claude.sh # Claude Code launcher
 ├── local_codex.sh # Codex launcher
 └── open_code.sh # VS Code launcher

🔐 Security Notes

API Keys: The default API key is sk-FAKE - change this for production use
Secrets: Never commit secrets to git. Use the secrets/ directory (gitignored)
Network: Services are exposed on localhost by default. Configure firewall rules for external access
Sudo: The runMe.sh script uses sudo when needed. Review Docker group membership for sudo-less operation

🤝 Contributing

When adding a new model:

Create a new .yml file in models/ directory
Follow the existing format (see models/README.md)
Test with ./runMe.sh your-new-model
Update the models table in this README
Add model card details to models/README.md

📚 Further Reading

Model Configurations: models/README.md - Detailed model specs and tuning guide
vLLM Documentation: https://docs.vllm.ai/
LiteLLM Documentation: https://docs.litellm.ai/
Docker Compose Documentation: https://docs.docker.com/compose/

📜 License

See LICENSE file for details.

🆘 Support

Issues: Check logs first (sudo docker compose logs -f)
Documentation: See models/README.md for model-specific details
vLLM Docs: https://docs.vllm.ai/en/latest/
GPU Issues: Run nvidia-smi to check GPU status

Last Updated: February 2026 Compatible With: vLLM v1+, Docker Compose v2+

Folders and files

Latest commit

History

Repository files navigation

vLLM Multi-Model Server

🚀 Quick Start

Launch a Model

Alternative: Direct Docker Compose

📋 Available Models

🏗️ Architecture

🔧 Configuration

Environment Variables

Model-Specific Configuration

Secrets Management

📡 API Usage

Accessing the API

Example Usage

Python (OpenAI SDK)

cURL

Using Claude Code with Local Model

🛠️ Management Commands

Starting Services

Stopping Services

Viewing Logs

Switching Models

Rebuilding After Configuration Changes

🔍 Health Checks

Check Service Status

Monitor GPU Usage

Check Container Status

🐛 Troubleshooting

Issue: Model Always Defaults to glm47-flash

Issue: Out of Memory (OOM) Errors

Issue: Container Fails Health Check

Issue: Permission Denied

Issue: Model Download is Slow

Issue: Port Already in Use

📁 Project Structure

🔐 Security Notes

🤝 Contributing

📚 Further Reading

📜 License

🆘 Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages