tensorrt-llm

🚀🚀🚀 This repository lists some awesome public CUDA, cuda-python, cuBLAS, cuDNN, CUTLASS, TensorRT, TensorRT-LLM, Triton, TVM, MLIR, PTX and High Performance Computing (HPC) projects.

awesome hpc gpu cuda pytorch cublas triton blas llama cutlass cudnn gemm vlm tensorrt ptx tvm mlir llm tensorrt-llm deepseek

Updated Aug 2, 2025

huggingface / optimum-benchmark

Star 318

🏋️ A unified multi-backend utility for benchmarking Transformers, Timm, PEFT, Diffusers and Sentence-Transformers with full support of Optimum's hardware optimizations & quantization schemes.

benchmark pytorch openvino onnxruntime text-generation-inference neural-compressor tensorrt-llm

Updated Sep 25, 2025
Python

npuichigo / openai_trtllm

Star 215

OpenAI compatible API for TensorRT LLM triton backend

triton-inference-server openai-api llm langchain tensorrt-llm

Updated Aug 1, 2024
Rust

Deep Learning Deployment Framework: Supports tf/torch/trt/trtllm/vllm and other NN frameworks. Support dynamic batching, and streaming modes. It is dual-language compatible with Python and C++, offering scalability, extensibility, and high performance. It helps users quickly deploy models and provide services through HTTP/RPC interfaces.

tensorflow torch tensorrt serving triton-inference-server dynamic-batching vllm tensorrt-llm

Updated May 8, 2025
C++

NetEase-Media / grps_trtllm

Star 156

Higher performance OpenAI LLM service than vLLM serve: A pure C++ high-performance OpenAI LLM service implemented with GPRS+TensorRT-LLM+Tokenizers.cpp, supporting chat and function call, AI agents, distributed multi-GPU inference, multimodal capabilities, and a Gradio chat interface.

openai multi-modal phi function-call qwq ai-agent llm llama-index chatglm internvideo tensorrt-llm qwen2 llama3 minicpm-v internvl qwen2-vl deepseek-r1 janus-pro olmocr qwen3

Updated May 14, 2025
Python

vossr / Chat-With-RTX-python-api

Star 64

Chat With RTX Python API

tensorrt llm llm-inference tensorrt-llm mistral-7b llama2-13b chat-with-rtx nvidia-chat-with-rtx

Updated May 11, 2025
Python

guidance-ai / llgtrt

Star 59

TensorRT-LLM server with Structured Outputs (JSON) built with Rust

json regex guidance cfg openai-api tensorrt-llm structured-generation

Updated Apr 25, 2025
Rust

argonne-lcf / LLM-Inference-Bench

Star 55

LLM-Inference-Bench

benchmark inference deepspeed llm llamacpp vllm tensorrt-llm

Updated Jul 18, 2025
Jupyter Notebook

modal-labs / stopwatch

Star 43

A tool for benchmarking LLMs on Modal

machine-learning llms vllm tensorrt-llm sglang

Updated Aug 29, 2025
Python

menloresearch / cortex.tensorrt-llm

Star 42

Cortex.Tensorrt-LLM is a C++ inference library that can be loaded by any server at runtime. It submodules NVIDIA’s TensorRT-LLM for GPU accelerated inference on NVIDIA's GPUs.

nvidia jan tensorrt llm tensorrt-llm

Updated Sep 26, 2024
C++

fgblanch / OutlookLLM

Star 42

Add-in for new Outlook that adds LLM new features (Composition, Summarizing, Q&A). It uses a local LLM via Nvidia TensorRT-LLM

outlook-addin tensorrt-llm

Updated Jun 5, 2025
Python

CactusQ / TensorRT-LLM-Tutorial

Star 23

Getting started with TensorRT-LLM using BLOOM as a case study

jupyter-notebook deeplearning tensorrt tensorrt-inference llms llm-inference tensorrt-llm

Updated Mar 7, 2024
Jupyter Notebook

lix19937 / llm-deploy

Star 21

AI Infra LLM infer/ tensorrt-llm/ vllm

llm llm-inference tensorrt-llm

Updated Dec 17, 2024
Python

zRzRzRzRzRzRzR / lm-fly

Sponsor

Star 20

大模型推理框架加速,让 LLM 飞起来

mlx tgi openvino llm vllm llm-inference tensorrt-llm

Updated May 10, 2024
Python

EdVince / whisper-trtllm

Star 16

Whisper in TensorRT-LLM

cuda transformers openai whisper asr tensorrt huggingface tensorrt-llm

Updated Sep 21, 2023
C++

wcks13589 / LLM-Tutorial

Star 11

LLM tutorial materials include but not limited to NVIDIA NeMo, TensorRT-LLM, Triton Inference Server, and NeMo Guardrails.

nemo nvidia-nemo llm nemo-guardrails tensorrt-llm

Updated Jun 26, 2025
Python

Delxrius / MiniMax-01

Star 5

MiniMax-01 is a simple implementation of the MiniMax algorithm, a widely used strategy for decision-making in two-player turn-based games like Tic-Tac-Toe. The algorithm aims to minimize the maximum possible loss for the player, making it a popular choice for developing AI opponents in various game scenarios.

chatbot minimax chat-api llm llm-inference flash-attention tensorrt-llm paged-attention deepseek hailuoai deepseek-v3 minimax-text-01 minimax-vl-01 minimax-01

Updated Oct 18, 2025

Improve this page

Add a description, image, and links to the tensorrt-llm topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the tensorrt-llm topic, visit your repo's landing page and select "manage topics."

Learn more

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly