Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

LLMSystems/TensorrtServer

Repository files navigation

TensorRT Model Inference Server

A high-performance deep learning model inference server based on TensorRT, supporting fast inference for Embedding, Reranker, and NLI models.

English | 中文

Features

  • High-Performance Inference: Model inference optimized with NVIDIA TensorRT
  • Dynamic Batching: Supports automatic batch aggregation to improve inference efficiency
  • Multi-Model Support: Simultaneously supports Embedding, Reranker, and NLI models
  • RESTful API: Provides a standard HTTP API interface
  • OpenAI Compatible: Supports API calls in OpenAI SDK format
  • GPU Memory Optimization: Efficient GPU memory management

Supported Models

Currently only supports Embedding, Reranker, and NLI

Installation & Setup

1. Clone the Repository

git clone https://github.com/FubonDS/TensorrtServer.git
cd TensorrtServer

2. Install Dependencies

Refer to docker/docker-compose.yaml as the base image

pip install -r requirements.txt

Model Conversion

Model conversion consists of two steps: PyTorch → ONNX and ONNX → TensorRT.

Step 1: Convert to ONNX

The relevant scripts are located in the trt_convert/ directory and all support argparse arguments.

Embedding Model (BGE-M3)

python trt_convert/embedding2onnx.py \
 --model_path ./embedding_engine/model/embedding_model/bge-m3-model \
 --tokenizer_path ./embedding_engine/model/embedding_model/bge-m3-tokenizer \
 --output_path ./embedding_models/model_dynamic/bge_m3_embedding_dynamic.onnx \
 --max_length 256
Parameters & Script Details (click to expand)
Parameter Default Description
--model_path ./embedding_engine/model/embedding_model/bge-m3-model Model path
--tokenizer_path ./embedding_engine/model/embedding_model/bge-m3-tokenizer Tokenizer path
--output_path ./embedding_models/model_dynamic/bge_m3_embedding_dynamic.onnx Output ONNX path
--max_length 256 Maximum sequence length

Script notes:

  • Loads the model with AutoModel and uses EmbeddingWrapper to extract the CLS token (last_hidden_state[:, 0, :]) as output
  • Inputs: input_ids, attention_mask (int32); Output: embeddings
  • Dynamic axis: batch_size (dimension 0), ONNX opset: 17

Reranker Model (BGE-Reranker-Large)

python trt_convert/rerank2onnx.py \
 --model_path ./reranking_model/bge-reranker-large-model \
 --tokenizer_path ./reranking_model/bge-reranker-large-tokenizer \
 --output_path ./model_dynamic/bge_reranker_large_dynamic.onnx \
 --max_length 256
Parameters & Script Details (click to expand)
Parameter Default Description
--model_path ./reranking_model/bge-reranker-large-model Model path
--tokenizer_path ./reranking_model/bge-reranker-large-tokenizer Tokenizer path
--output_path ./model_dynamic/bge_reranker_large_dynamic.onnx Output ONNX path
--max_length 256 Maximum sequence length

Script notes:

  • Loads the model with AutoModelForSequenceClassification; RerankerWrapper outputs logits.squeeze(-1) as relevance scores
  • Inputs: input_ids, attention_mask (int32); Output: scores
  • Dynamic axis: batch_size (dimension 0), ONNX opset: 17

NLI Model (XLM-RoBERTa-Large-XNLI)

python trt_convert/nli2onnx.py \
 --model_path joeddav/xlm-roberta-large-xnli \
 --tokenizer_path joeddav/xlm-roberta-large-xnli \
 --output_path ./model_dynamic_bs/nli_model_dynamic_bs.onnx \
 --max_length 256
Parameters & Script Details (click to expand)
Parameter Default Description
--model_path joeddav/xlm-roberta-large-xnli Model path (supports HuggingFace Hub)
--tokenizer_path joeddav/xlm-roberta-large-xnli Tokenizer path
--output_path ./model_dynamic_bs/nli_model_dynamic_bs.onnx Output ONNX path
--max_length 256 Maximum sequence length

Script notes:

  • Loads the model with AutoModelForSequenceClassification and directly outputs logits (3 NLI class scores)
  • Inputs: input_ids, attention_mask (int32); Output: logits
  • Dynamic axis: batch_size (dimension 0), ONNX opset: 17

Step 2: Convert ONNX to TensorRT

Must be run inside a TensorRT Docker container (see docker/docker-compose.yaml), using the trtexec tool.

Static Batch Size

trtexec \
 --onnx=./model/nli_model_dynamic_bs.onnx \
 --saveEngine=nli_model_bs8.trt \
 --fp16

Dynamic Batch Size (Recommended)

Dynamic batch size allows the model to accept batches of varying sizes at inference time. Examples for each model:

Embedding Model

trtexec \
 --onnx=./embedding_models/bge_m3_embedding_dynamic.onnx \
 --saveEngine=bge_m3_model_dynamic_bs.trt \
 --fp16 \
 --minShapes=input_ids:1x256,attention_mask:1x256 \
 --optShapes=input_ids:8x256,attention_mask:8x256 \
 --maxShapes=input_ids:32x256,attention_mask:32x256

Reranker Model

trtexec \
 --onnx=./reranker_models/bge_reranker_large_dynamic.onnx \
 --saveEngine=bge_reranker_large_dynamic_bs.trt \
 --fp16 \
 --minShapes=input_ids:1x256,attention_mask:1x256 \
 --optShapes=input_ids:8x256,attention_mask:8x256 \
 --maxShapes=input_ids:32x256,attention_mask:32x256

NLI Model

trtexec \
 --onnx=./nli_models/nli_model_dynamic_bs.onnx \
 --saveEngine=nli_model_dynamic_bs.trt \
 --fp16 \
 --minShapes=input_ids:1x256,attention_mask:1x256 \
 --optShapes=input_ids:8x256,attention_mask:8x256 \
 --maxShapes=input_ids:32x256,attention_mask:32x256

trtexec Parameter Reference

Parameter Description
--onnx Input ONNX model path
--saveEngine Output TensorRT engine path
--fp16 Enable FP16 precision to accelerate inference and reduce memory usage
--minShapes Minimum input size for dynamic shapes: tensor_name:dim0xdim1
--optShapes Optimal input size for dynamic shapes (affects TRT optimization focus)
--maxShapes Maximum input size for dynamic shapes

After conversion, set the .trt file path in the model_path field of configs/config.yaml.

Server Configuration File

Edit configs/config.yaml to configure your models:

nli_models:
 xlm-roberta-large-xnli:
 model_name: "xlm-roberta-large-xnli"
 model_path: "./model/nlimodels/trtmodels/nli_model_dynamic_bs.trt"
 tokenizer_path: "joeddav/xlm-roberta-large-xnli"
 reuse_dynamic_buffer: true
 cuda_graph_list:
 - 1
 - 3
 - 5
embedding_models:
 bge-m3:
 model_name: "bge-m3"
 model_path: "./model/embedding_models/trt_models/bge_m3_model_dynamic_bs.trt"
 tokenizer_path: "./model/embedding_models/bge-m3-tokenizer"
 reuse_dynamic_buffer: true
 cuda_graph_list:
 - 1
 - 3
 - 5
reranking_models:
 bge-reranker-large:
 model_name: "bge-reranker-large"
 model_path: "./model/reranker_models/trt_models/bge_reranker_large_dynamic_bs.trt"
 tokenizer_path: "./model/reranker_models/bge-reranker-large-tokenizer"
 reuse_dynamic_buffer: true
 cuda_graph_list:
 - 1
 - 3
 - 5

Configuration Reference

  • model_name: Model identifier name
  • model_path: TensorRT model file path
  • tokenizer_path: Tokenizer path (can be a local path or a Hugging Face model name)
  • reuse_dynamic_buffer: Whether to pre-allocate buffers during initialization for dynamic batch sizes, avoiding dynamic allocation on each inference call
  • cuda_graph_list: Pre-generates a fixed sequence of GPU kernel calls for the specified batch sizes, reducing CPU→GPU kernel launch overhead

Starting the Server

Using the Script

chmod +x start_tensorrt_server.sh
./start_tensorrt_server.sh

After startup, the API will be available at http://{ip}:{port}.

API Usage

The service provides two API formats: the native API and the OpenAI-compatible API.

List Available Models

curl http://localhost:8887/models

Response:

{
 "models": {
 "embedding_models": ["bge-m3"],
 "reranking_models": ["bge-reranker-large"],
 "nli_models": ["xlm-roberta-large-xnl"]
 }
}

Native API Format

Usage Instructions (click to expand)

1. Embedding Inference

import requests
url = "http://localhost:8887/infer/bge-m3"
payload = {"documents": ["This is a test text", "Another test text"]}
response = requests.post(url, json=payload)
print(response.json())
# Output:
# {
# "embeddings": [[0.1, 0.2, ...], [0.3, 0.4, ...]],
# "elapsed_ms": 15.2
# }

2. Reranker Inference

import requests
url = "http://localhost:8887/infer/bge-reranker-large"
payload = {
 "query": "Theory in machine learning is important",
 "documents": [
 "Theory is very important for understanding machine learning",
 "Practical experience is also crucial in machine learning"
 ]
}
response = requests.post(url, json=payload)
print(response.json())
# Output:
# {
# "scores": [9.5078125, 7.2421875],
# "elapsed_ms": 5.78
# }

3. NLI (Natural Language Inference)

import requests
url = "http://localhost:8887/infer/xlm-roberta-large-xnli"
payload = {
 "premises": ["The weather is nice today", "Cats are animals"],
 "hypotheses": ["Today is sunny", "Dogs are animals"]
}
response = requests.post(url, json=payload)
print(response.json())
# Output:
# {
# "predictions": ["entailment", "neutral"],
# "logits": [[2.18, -1.38, -0.72], [1.02, -0.42, -0.65]],
# "elapsed_ms": 12.45
# }

OpenAI-Compatible API

Note: Only Embedding and Reranker models are supported; NLI models do not support this format.

Usage Instructions (click to expand)

1. Embedding API

from openai import OpenAI
client = OpenAI(
 api_key="EMPTY", # any value works
 base_url="http://localhost:8887/v1"
)
text = "This is a test text"
response = client.embeddings.create(
 input=[text],
 model="bge-m3"
)
print(response.data[0].embedding)

2. Reranker API

from openai import OpenAI
client = OpenAI(
 api_key="EMPTY",
 base_url="http://localhost:8887/v1"
)
documents = [
 "Machine learning is best learned through projects",
 "Theory is important for understanding machine learning"
]
response = client.embeddings.create(
 model="bge-reranker-large",
 input=documents,
 extra_body={"query": "Theory is important for understanding machine learning"}
)
# Get reranking scores
scores = [data.embedding for data in response.data]
print(scores)

Request Parameter Reference

Embedding Request

  • documents (required): A string or list of strings to encode
  • model (optional): Model name; required when using the OpenAI API

Reranker Request

  • query (required): Query string
  • documents (required): List of candidate documents
  • model (optional): Model name; required when using the OpenAI API

NLI Request

  • premises (required): List of premise sentences
  • hypotheses (required): List of hypothesis sentences

Response Format Reference

Embedding Response

{
 "embeddings": [[0.1, 0.2, ...]], // list of embedding vectors
 "elapsed_ms": 15.2 // inference time (milliseconds)
}

Reranker Response

{
 "scores": [9.5078125], // list of relevance scores
 "elapsed_ms": 5.78 // inference time (milliseconds)
}

NLI Response

{
 "predictions": ["entailment"], // list of predicted labels
 "logits": [[2.18, -1.38, -0.72]], // raw scores
 "elapsed_ms": 12.45 // inference time (milliseconds)
}

NLI Label Descriptions

  • entailment: The premise supports the hypothesis
  • neutral: The premise and hypothesis are unrelated
  • contradiction: The premise and hypothesis conflict

Performance Benchmarks

The following results were obtained on an NVIDIA A100 GPU:

Embedding Inference Performance

Embedding Performance Chart Embedding Detailed Metrics Embedding CUDA Graph Metrics

Test Configuration:

  • Model: BGE-M3
  • Batch size: 1–64
  • Inference comparison: Torch vs. TensorRT
  • Server comparison: Sequential inference vs. Dynamic batching

Reranker Inference Performance

Reranker Performance Chart Reranker Detailed Metrics

Test Configuration:

  • Model: BGE-Reranker-Large
  • Batch size: 1–64
  • Inference comparison: Torch vs. TensorRT
  • Server comparison: Sequential inference vs. Dynamic batching

NLI Inference Performance

NLI Performance Chart NLI Detailed Metrics

Test Configuration:

  • Model: XLM-RoBERTa-Large-XNLI
  • Batch size: 1–64
  • Inference comparison: Torch vs. TensorRT
  • Server comparison: Sequential inference vs. Dynamic batching

Architecture Design

System Architecture Diagram

sequenceDiagram
 participant Client as Client
 participant API as FastAPI Service
 participant Worker as Worker Handler
 participant TRT as TensorRT Inferencer
 participant GPU as GPU (CUDA + TensorRT)
 Client->>API: HTTP Request (/infer, /v1/embeddings)
 API->>Worker: Dynamic Queue (payload, Future)
 Worker->>Worker: Collect Batch (max_batch=32, max_wait=10ms)
 Worker->>TRT: Call model.infer(all_docs)
 TRT->>TRT: Tokenizer → input_ids, attention_mask
 TRT->>TRT: Convert dtype → engine dtype
 TRT->>GPU: H2D (memcpy_htod_async)
 Note right of GPU: GPU buffer ready
 TRT->>GPU: set_input_shape / set_tensor_address
 TRT->>GPU: execute_async_v3()
 GPU-->>TRT: GPU inference complete
 Note over GPU: Engine executes inference on GPU
 TRT->>GPU: D2H (memcpy_dtoh_async)
 TRT->>TRT: Reshape and truncate to original length
 TRT-->>Worker: Predictions / logits
 Worker->>Worker: Split batch results
 Worker-->>API: Future.set_result()
 API-->>Client: Return JSON result
Loading

License

This project is licensed under the MIT License.

About

A high-performance deep learning model inference server based on TensorRT, supporting fast inference for Embedding, Reranker, and NLI models.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

Contributors

AltStyle によって変換されたページ (->オリジナル) /