LLM Router Server is a high-performance routing service designed for multi-model deployment scenarios, used to uniformly manage and orchestrate multiple local Large Language Model (LLM) services, Embedding models, Re-ranking models, and other inference services.
- Unified Routing Management: Integrates multiple independent vLLM services, Embedding services, and Reranker services
- OpenAI-Compatible API: Provides fully compatible OpenAI API interfaces (
/v1/chat/completions,/v1/completions,/v1/embeddings) - Configuration-Based Deployment: Easily manage startup parameters, ports, GPU allocation, etc. for multiple models through YAML configuration files
- Multi-Model Parallelism: Supports multiple model instances running simultaneously, each using independent processes and GPU resources
- Intelligent Load Balancing: Automatically selects the least-loaded instance based on real-time metrics (running requests, waiting requests, KV cache usage)
- High-Performance Forwarding: High-performance asynchronous architecture based on FastAPI + Gunicorn + Uvloop
- Streaming Response Optimization: Optimizes streaming requests to ensure low latency and stable token output
- Multi-Model Service Deployment: Deploy multiple LLM models on single or multiple servers
- Model Load Balancing: Dynamically select different models based on business requirements
- Unified API Interface: Provide unified API endpoints for different models
- RAG Applications: Integrate Embedding and Reranking services to build complete Retrieval-Augmented Generation systems
- Each LLM model is launched through an independent process, using different ports and CUDA devices
- Supports dynamic configuration of model count, GPU memory allocation, concurrent request numbers, and other parameters
- Models are isolated from each other; a single model failure does not affect other services
- Real-Time Metrics Monitoring: Continuously polls vLLM
/metricsendpoint for each instance to gather:- Number of running requests
- Number of waiting requests
- KV cache usage percentage
- Total prompt and generation tokens
- Least-Load Selection: Automatically routes requests to the instance with the lowest load score
- Load Score Calculation: Combines multiple metrics with configurable weights:
- Waiting requests weight: 10.0
- Running requests weight: 3.0
- KV cache usage weight: 100.0
- Health Monitoring: Tracks backend health status and applies cooldown periods for failed instances
- Inflight Request Tracking: Monitors in-flight requests to prevent overloading any single instance
- Built-in Embedding server and Reranker server
- Supports multiple Embedding models (m3e-base, bge-m3, etc.)
- Supports multiple Reranking models (bge-reranker-large, etc.)
- Unified forwarding of
/v1/embeddingsrequests
- Supports direct invocation using OpenAI Python SDK
- No need to modify existing code, just change the
base_url - Supports all standard parameters (temperature, top_p, max_tokens, etc.)
- Client Request: Client sends requests to Router Server via OpenAI SDK or HTTP client
- Route Resolution: Router looks up corresponding backend service configuration based on the
modelparameter in the request - Load-Based Instance Selection: For models with multiple instances:
- Fetches real-time metrics from all instances
- Calculates load score for each instance
- Selects the instance with the lowest load
- Considers health status and cooldown periods
- Request Forwarding: Forwards the request to the selected vLLM or Embedding service instance
- Streaming Processing: Optimizes streaming responses to ensure low-latency transmission
- Health Tracking: Monitors request success/failure and updates instance health status
- Response Return: Returns the backend service response to the client as-is
LLM-Router-Server/
├── configs/ # Configuration directory
│ ├── config.yaml # Main configuration file (models, server settings)
│ └── gunicorn.conf.py # Gunicorn configuration
├── docker/ # Docker related files
│ ├── Dockerfile # Docker image build file
│ └── docker-compose.yaml # Docker Compose configuration
├── logs/ # Log directory
├── scripts/ # Startup scripts directory
│ ├── start_all_models.py # Python script to start all models
│ └── start_all.sh # One-click startup script (models + router)
├── src/ # Main source code directory
│ ├── embedding_reranker/ # Embedding and Reranker module
│ │ ├── __init__.py
│ │ ├── embedding_reranker_launcher.py # Launcher
│ │ ├── schema.py # Data structure definitions
│ │ └── embedding_engine/ # Inference engine
│ │ ├── baseinferencer.py # Base inference class
│ │ ├── embed_rerank.py # Embedding/Rerank implementation
│ │ ├── generator.py # Generator
│ │ └── optimize.py # Optimization tools
│ ├── llm_router/ # LLM routing module
│ │ ├── __init__.py
│ │ ├── config_loader.py # Configuration loader
│ │ ├── env.py # Environment variable management
│ │ ├── main.py # FastAPI application entry point
│ │ ├── router.py # Routing logic
│ │ └── vllm_launcher.py # vLLM launcher
│ └── metrics/ # Monitoring and metrics
│ └── basic_metrics.py # Basic metrics collection
├── test/ # Test files directory
│ └── test_router_server.py # Router server tests
├── requirements.txt # Python dependencies list
└── README.md # Project documentation
pip install -r requirements.txt
The main configuration file is located at configs/config.yaml and contains two main sections:
Configure one or more LLM models with multiple instances:
LLM_engines: # Model with multiple instances Qwen3-0.6B: instances: # First instance - id: "qwen3-1" # Instance ID host: "localhost" # Service host port: 8002 # Service port cuda_device: 0 # CUDA device number # Second instance - id: "qwen3-2" # Instance ID host: "localhost" # Service host port: 8004 # Service port cuda_device: 0 # CUDA device number # Model configuration (shared by all instances) model_config: model_tag: "Qwen/Qwen3-0.6B" # Model path or HuggingFace ID dtype: "float16" # Data type max_model_len: 500 # Maximum sequence length gpu_memory_utilization: 0.35 # GPU memory utilization tensor_parallel_size: 1 # Tensor parallel size # Embedding and Reranking server configuration embedding_server: host: "localhost" port: 8005 cuda_device: 1 # Embedding model list embedding_models: m3e-base: model_name: "moka-ai/m3e-base" model_path: "./models/embedding_engine/model/embedding_model/m3e-base-model" tokenizer_path: "./models/embedding_engine/model/embedding_model/m3e-base-tokenizer" max_length: 512 use_gpu: true use_float16: true bge-m3: model_name: "BAAI/bge-m3" model_path: "./models/embedding_engine/model/embedding_model/bge-m3-model" tokenizer_path: "./models/embedding_engine/model/embedding_model/bge-m3-tokenizer" max_length: 512 use_gpu: true use_float16: true # Reranking model list reranking_models: bge-reranker-large: model_name: "BAAI/bge-reranker-large" model_path: "./models/embedding_engine/model/reranking_model/bge-reranker-large-model" tokenizer_path: "./models/embedding_engine/model/reranking_model/bge-reranker-large-tokenizer" max_length: 512 use_gpu: true use_float16: true
LLM Engine Parameters:
Instance Configuration:
id: Unique identifier for the instancehost: Host address for vLLM service to listen onport: Port for vLLM service to listen oncuda_device: GPU device number to use
Model Configuration (shared by all instances):
model_tag: Model file path or HuggingFace model IDdtype: Model precision type (float16,bfloat16, etc.)max_model_len: Maximum context lengthgpu_memory_utilization: GPU memory utilization (0.0-1.0)tensor_parallel_size: Tensor parallelism degree (multi-GPU inference)
Embedding Server Parameters:
host,port: Server listening address and portcuda_device: GPU device to usemodel_path: Model weight file pathtokenizer_path: Tokenizer file pathmax_length: Maximum sequence lengthuse_gpu: Whether to use GPUuse_float16: Whether to use FP16 precision
Edit configs/gunicorn.conf.py:
# gunicorn.conf.py import os # Bind address and port bind = "0.0.0.0:8947" # Number of workers (recommended: CPU core count) workers = 4 # Worker class (using Uvicorn Worker for ASGI support) worker_class = "uvicorn.workers.UvicornWorker" # Timeout (0 means unlimited) timeout = 0 # Log level loglevel = "info" # Access log output to stdout accesslog = "-" # Error log output to stdout errorlog = "-" # Whether to preload the application preload_app = False
Use the one-click startup script:
sh scripts/start_all.sh ./configs/config.yaml ./configs/gunicorn.conf.py
This script will execute in sequence:
- Start all configured vLLM model services
- Start Embedding and Reranker services (if configured)
- Start Router Server (using Gunicorn + multiple workers)
Check all available models:
curl http://localhost:8947/v1/models
from openai import OpenAI client = OpenAI( api_key="EMPTY", base_url="http://localhost:8947/v1" ) # Non-streaming request response = client.chat.completions.create( model="Qwen2.5-14B-Instruct", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Please introduce the advantages of Python."} ], temperature=0.7, max_tokens=500 ) print(response.choices[0].message.content) # Streaming request stream = client.chat.completions.create( model="Qwen2.5-14B-Instruct", messages=[ {"role": "user", "content": "Write a poem about spring."} ], temperature=0.8, stream=True ) for chunk in stream: if chunk.choices[0].delta.content: print(chunk.choices[0].delta.content, end="", flush=True)
response = client.embeddings.create( model="m3e-base", input=["This is the first text", "This is the second text"] ) # Get embedding vectors embedding_1 = response.data[0].embedding embedding_2 = response.data[1].embedding print(f"Embedding dimension: {len(embedding_1)}")
documents = [ "Machine learning is best learned through projects.", "Theory is essential for understanding machine learning.", "Practical tutorials are the best way to learn machine learning." ] response = client.embeddings.create( model="bge-reranker-large", input=documents, extra_body={"query": "How to learn machine learning?"} ) # Get reranking scores for idx, item in enumerate(response.data): print(f"Document {idx}: Score {item.embedding}")
| Endpoint | Method | Description |
|---|---|---|
/v1/chat/completions |
POST | Chat completion (supports streaming) |
/v1/completions |
POST | Text completion (supports streaming) |
/v1/embeddings |
POST | Text embeddings / Reranking |
/v1/models |
GET | List all available models |
LLM Router Streaming 問題紀錄與解法.md: Streaming response optimization guideLLM Router 吞吐優化.md: Throughput optimization guide
This project is licensed under the MIT License.