wyjoutstanding
Stars
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels
NVIDIA Linux open GPU kernel module source
KernelBench: Can LLMs Write GPU Kernels? - Benchmark + Toolkit with Torch -> CUDA (+ more DSLs)
SGLang is a high-performance serving framework for large language models and multimodal models.
A Datacenter Scale Distributed Inference Serving Framework
High-performance inference framework for large language models, focusing on efficiency, flexibility, and availability.
FlashInfer: Kernel Library for LLM Serving
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
FlagGems is an operator library for large language models implemented in the Triton Language.
how to optimize some algorithm in cuda.
LaTeX Thesis Template for the University of Chinese Academy of Sciences
[ARCHIVED] The C++ parallel algorithms library. See https://github.com/NVIDIA/cccl
The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
Development repository for the Triton language and compiler
The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.
Efficient Deep Learning Systems course materials (HSE, YSDA)
Ongoing research training transformer models at scale
Code for loralib, an implementation of "LoRA: Low-Rank Adaptation of Large Language Models"
A minimal PyTorch re-implementation of the OpenAI GPT (Generative Pretrained Transformer) training
The simplest, fastest repository for training/finetuning medium-sized GPTs.
PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)