Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Clear-Top/ai-inference-resources

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

7 Commits

Repository files navigation

Learning Guide: AI Inference Engineering

Purpose

A curated collection of resources for engineers working on AI inference systems — covering LLM serving, GPU kernel programming, attention mechanisms, quantization, distributed inference, and production deployment. Compiled from the AER Labs community.

How to read

Recommended reading order:

  1. Read "Tier 1" for all topics first (foundational concepts)
  2. Read "Tier 2" for all topics (intermediate depth)
  3. Read "Tier 3" for all topics (advanced / cutting-edge)

Table of contents


1. LLM Inference Fundamentals

Tier 1

Tier 2

Tier 3

2. Inference Engines & Serving Systems

Tier 1

Tier 2

Tier 3

3. Attention Mechanisms & Memory Optimization

Tier 1

Tier 2

Tier 3

4. Quantization & Model Compression

Tier 1

Tier 2

Tier 3

5. CUDA & GPU Kernel Programming

Tier 1

Tier 2

Tier 3

6. Structured Output & Guided Decoding

Tier 1

Tier 2

7. Distributed & Multi-GPU Inference

Tier 1

Tier 2

Tier 3

  • How To Scale Your Model - JAX Team. Comprehensive book covering TPU/GPU architecture, inter-device communication, and parallelism strategies for training and inference at scale.

8. Post-Training & Fine-Tuning

Tier 1

  • Post-training 101 - Han Fang, Karthik A Sankararaman. Hitchhiker's guide to LLM post-training covering RLHF, DPO, and modern alignment techniques.

Tier 2

9. Hardware Architecture & Co-Design

Tier 1

  • Domain-Specific Architectures - Fleetwood. Overview of domain-specific hardware design principles and their application to AI accelerator architectures.

Tier 2

10. State-Space Models & Alternative Architectures

Tier 2

11. Compiler & DSL Approaches

Tier 1

  • Helion: Python-Embedded DSL for ML Kernels - PyTorch. A Python-embedded domain-specific language for writing fast, scalable ML kernels with minimal boilerplate, lowering the barrier to custom kernel development.

Tier 2

  • AOTInductor: Ahead-of-Time Compilation for PyTorch - PyTorch. Official documentation for AOTInductor, enabling ahead-of-time compilation of PyTorch models for deployment without Python runtime dependency.

  • Helion Flex Attention Example - PyTorch. Reference implementation of flexible attention variants using Helion DSL, demonstrating how to write custom attention kernels with minimal code.

  • CUDA Tile IR - NVIDIA. MLIR-based intermediate representation and compiler infrastructure for CUDA kernel optimization, focusing on tile-based computation patterns targeting NVIDIA tensor cores.

  • cuTile Python Samples - PeaBrane. Sample implementations using the cuTile programming model for writing parallel GPU kernels.

  • Intel ISPC: Implicit SPMD Program Compiler - Intel. Open-source compiler for high-performance SIMD programming on CPU and GPU using an implicit SPMD model.

12. Confidential & Secure Inference

Tier 2

13. AI Agents & LLM Tooling

Tier 1

  • AgentKernelArena - AMD AGI. End-to-end benchmarking environment for evaluating LLM-powered coding agents (Cursor, Claude Code, Codex, SWE-agent, GEAK) on CUDA kernel writing tasks.

Tier 2

14. Production Inference at Scale

Tier 2

Tier 3

15. Benchmarking & Profiling

Tier 1

  • Evaluation Guidebook - OpenEvals / HuggingFace. Comprehensive guide to evaluating AI models, covering evaluation methodologies, metrics, and best practices.

  • AI Hardware Benchmarking & Performance Analysis - Artificial Analysis. Comprehensive benchmarking of AI accelerator systems for LLM inference across chip configurations, inference software, and concurrent load scaling.

Tier 2

16. Courses & Comprehensive Guides

Tier 1

Tier 2

17. Tools & Libraries

Tier 1

Tier 2

18. Reference Collections

  • GPU Performance Engineering Resources - Wafer AI. Comprehensive tiered learning guide for GPU kernel programming and optimization, covering fundamentals through production deployment.

  • AER Labs Blog - AER Labs. Technical blog covering AI inference optimization, vLLM architecture, PagedAttention, KV cache systems, and LLM deployment strategies.


Contributing

Have a resource to share? Open a pull request or issue with the link, a brief description, and suggested category/tier placement.

License

MIT

About

Curated collection of AI inference engineering resources — LLM serving, GPU kernels, quantization, distributed inference, and production deployment. Compiled from the AER Labs community.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

Contributors

AltStyle によって変換されたページ (->オリジナル) /