Best practices & guides on how to write distributed pytorch training code
- 
 Updated
 Oct 22, 2025 
- Python
Best practices & guides on how to write distributed pytorch training code
META LLAMA3 GENAI Real World UseCases End To End Implementation Guide
Llama-style transformer in PyTorch with multi-node / multi-GPU training. Includes pretraining, fine-tuning, DPO, LoRA, and knowledge distillation. Scripts for dataset mixing and training from scratch.
🦾💻🌐 distributed training & serverless inference at scale on RunPod
Fast and easy distributed model training examples.
A script for training the ConvNextV2 on CIFAR10 dataset using the FSDP technique for a distributed training scheme.
Minimal yet high performant code for pretraining llms. Attempts to implement some SOTA features. Implements training through: Deepspeed, Megatron-LM, and FSDP. WIP
Simple and efficient implementation of 671B DeepSeek V3 that trainable with FSDP+EP and minimal requirement of 256x A100/H100, targeted for HuggingFace ecosystem
Implementations of some popular approaches for efficient deep learning training and inference
Framework, Model & Kernel Optimizations for Distributed Deep Learning - Data Hack Summit
Dataloading for JAX
Scalable multimodal AI system combining FSDP, RLHF, and Inferentia optimization for customer insights generation.
Mini-FSDP for PyTorch. Minimal single-node Fully Sharded Data Parallel wrapper with param flattening, grad reduce-scatter, AMP, and tiny GPT/BERT training examples.
A foundational repository for setting up distributed training jobs using Kubeflow and PyTorch FSDP.
Training Qwen3 to solve Wordle using SFT and GRPO
Fully Sharded Data Parallel (FSDP) implementation of Transformer XL
Add a description, image, and links to the fsdp topic page so that developers can more easily learn about it.
To associate your repository with the fsdp topic, visit your repo's landing page and select "manage topics."