Name	Name	Last commit message	Last commit date
Latest commit History 655 Commits
.cursor/rules	.cursor/rules
.devcontainer	.devcontainer
.github/workflows	.github/workflows
.vscode	.vscode
config	config
dataset	dataset
doc	doc
docker	docker
etc	etc
notebook	notebook
result	result
script	script
src	src
tests	tests
.dockerignore	.dockerignore
.gitignore	.gitignore
.pre-commit-config.yaml	.pre-commit-config.yaml
.python-version	.python-version
CLAUDE.md	CLAUDE.md
README.md	README.md
multinode.sh	multinode.sh
pyproject.toml	pyproject.toml
template.env	template.env
test.py	test.py
torchrun.sh	torchrun.sh
train.py	train.py
uv.lock	uv.lock

PyTorch Project Template

A comprehensive, production-ready PyTorch project template with modular architecture, distributed training support, and modern tooling.

Features

🧩 Modular Architecture: Registry-based component system for easy extensibility
⚙️ Configuration Management: Hierarchical config system with inheritance and CLI overrides
🚀 Distributed Training: Multi-node/multi-GPU training with DDP, FSDP, and DataParallel
📊 Experiment Tracking: MLflow and Weights & Biases integration with auto-visualization
🔧 Modern Tooling: uv package management, pre-commit hooks, Docker support
💾 Resume Training: Automatic checkpoint saving and loading with state preservation
🌐 Cross-Platform: Development support on macOS (Apple Silicon MPS), Linux with optimized builds
🐳 Development Environment: Devcontainer and Jupyter Lab integration
⚡ Performance Optimization: RAM caching, mixed precision, torch.compile support
📚 Auto Documentation: Sphinx-based API docs with live reloading
📱 Slack Notifications: Training completion and error notifications
🛡️ Error Handling: Robust error recovery and automatic retries

Requirements

Python: 3.11+
Package Manager: uv
CUDA: 12.8 (for GPU training)
PyTorch: 2.7.1

Quick Start

1. Setup Project

Create a new project using this template:

# Option 1: Use as GitHub template (recommended)
# Click "Use this template" on GitHub
# Option 2: Clone and setup manually
git clone <your-repo-url>
cd your-project-name
# Option 3: Merge updates from this template
git remote add upstream https://github.com/mjun0812/PyTorch-Project-Template.git
git fetch upstream main
git merge --allow-unrelated-histories --squash upstream/main

2. Environment Configuration

# Copy environment template
cp template.env .env
# Edit .env with your API keys and settings

Example .env configuration:

# Slack notifications (optional)
# You can use either SLACK_TOKEN or SLACK_WEBHOOK_URL
SLACK_TOKEN="xoxb-your-token"
SLACK_CHANNEL="#notifications"
SLACK_USERNAME="Training Bot"
# Alternative: Webhook URL (simpler setup)
SLACK_WEBHOOK_URL="https://hooks.slack.com/services/..."
# MLflow tracking
MLFLOW_TRACKING_URI="./result/mlruns" # or remote URI
# Weights & Biases (optional)
WANDB_API_KEY="your-wandb-key"

3. Installation

Choose your preferred installation method:

Option A: Local Development (Recommended)

# Install dependencies
uv sync
# Setup development environment
uv run pre-commit install
# Run training
uv run python train.py config/dummy.yaml

Option B: Docker

# Build container
./docker/build.sh
# Run training in container
./docker/run.sh python train.py config/dummy.yaml

Option C: Development Container

Open the project in VS Code and use the devcontainer configuration for a consistent development environment.

Usage

Basic Training

Start with the dummy configuration to test your setup:

# Basic training with dummy dataset
python train.py config/dummy.yaml
# Override configuration from command line
python train.py config/dummy.yaml batch=32 gpu.use=0 optimizer.lr=0.001

Configuration Management

This template uses hierarchical configuration with inheritance support:

# Use dot notation to modify nested values
python train.py config/dummy.yaml gpu.use=0,1 model.backbone.depth=50
# Multiple overrides
python train.py config/dummy.yaml batch=64 epoch=100 optimizer.lr=0.01
# View current configuration
python script/show_config.py config/dummy.yaml
# Batch edit configuration files
python script/edit_configs.py config/dummy.yaml "optimizer.lr=0.01,batch=64"

Configuration hierarchy:

Dataclass defaults (src/config/config.py)
Base configs (config/__base__/)
Experiment configs (config/*.yaml) with __base__ inheritance
CLI overrides

Development Tools

# Launch Jupyter Lab for experimentation
./script/run_notebook.sh
# Start MLflow UI for experiment tracking
./script/run_mlflow.sh
# View all registered components
python script/show_registers.py
# View model architecture
python script/show_model.py
# Visualize learning rate schedules
python script/show_scheduler.py
# View data transformation pipeline
python script/show_transform.py
# Clean up orphaned result directories
python script/clean_result.py
# Aggregate MLflow results to CSV
python script/aggregate_mlflow.py all
# Start documentation server (auto-reloads on changes)
./script/run_docs.sh

Distributed Training

Scale your training across multiple GPUs and nodes:

Single Node, Multiple GPUs

# Use torchrun for DDP training (recommended)
./torchrun.sh 4 train.py config/dummy.yaml gpu.use="0,1,2,3"
# Alternative: DataParallel (not recommended for production)
python train.py config/dummy.yaml gpu.use="0,1,2,3" gpu.multi_strategy="dp"

Multi-Node Training

# Master node (node 0)
./multinode.sh 2 4 12345 0 master-ip:12345 train.py config/dummy.yaml gpu.use="0,1,2,3"
# Worker nodes (node 1+)
./multinode.sh 2 4 12345 1 master-ip:12345 train.py config/dummy.yaml gpu.use="0,1,2,3"

FSDP (Fully Sharded Data Parallel)

For very large models that don't fit in GPU memory:

python train.py config/dummy.yaml gpu.multi_strategy="fsdp" gpu.fsdp.min_num_params=100000000

Results and Checkpointing

Training results are automatically saved to:

result/[dataset_name]/[date]_[model_name]_[tag]/
├── config.yaml # Complete configuration used
├── models/ # Model checkpoints (latest.pth, best.pth, epoch_N.pth)
├── optimizers/ # Optimizer states 
└── schedulers/ # Scheduler states

Resume Training

Resume interrupted training using saved checkpoints:

# Resume from automatically saved checkpoint
python train.py result/dataset_name/20240108_ResNet_experiment/config.yaml
# Resume and extend training
python train.py result/dataset_name/20240108_ResNet_experiment/config.yaml epoch=200
# Resume with different configuration
python train.py result/dataset_name/20240108_ResNet_experiment/config.yaml gpu.use=1 batch=64

Evaluation

Run evaluation separately from training:

# Evaluate using saved model configuration
python test.py result/dataset_name/20240108_ResNet_experiment/config.yaml
# Evaluate with different GPU
python test.py result/dataset_name/20240108_ResNet_experiment/config.yaml gpu.use=1

Performance Optimization

RAM Caching

Speed up training by caching datasets in RAM:

python train.py config/dummy.yaml use_ram_cache=true ram_cache_size_gb=16

Implement caching in your custom dataset:

if self.cache is not None and idx in self.cache:
 data = self.cache.get(idx)
else:
 data = self.load_data(idx) # Your data loading logic
 if self.cache is not None:
 self.cache.set(idx, data)

Mixed Precision Training

# Enable automatic mixed precision with fp16
python train.py config/dummy.yaml use_amp=true amp_dtype="fp16"
# Use bfloat16 for newer hardware (A100, H100)
python train.py config/dummy.yaml use_amp=true amp_dtype="bf16"

torch.compile

# Enable PyTorch 2.0 compilation for speedup
python train.py config/dummy.yaml use_compile=true compile_backend="inductor"
# Alternative backends
python train.py config/dummy.yaml use_compile=true compile_backend="aot_eager"

Slack Notifications

Get notified about training progress and errors:

# Training will automatically send notifications on completion/error
python train.py config/dummy.yaml
# Manual notification testing
uv run --frozen pytest tests/test_slack_notification.py -v

Architecture

Project Structure

src/
├── config/ # Configuration management with inheritance
├── dataloaders/ # Dataset and DataLoader implementations 
├── models/ # Model definitions and backbones
│ ├── backbone/ # Pre-trained backbones (ResNet, Swin, etc.)
│ ├── layers/ # Custom layers and building blocks
│ └── losses/ # Loss function implementations
├── optimizer/ # Optimizer builders (including ScheduleFree)
├── scheduler/ # Learning rate schedulers
├── transform/ # Data preprocessing and augmentation
├── evaluator/ # Metrics and evaluation
├── runner/ # Training and testing loops
└── utils/ # Utilities (logger, registry, torch utils)
config/
├── __base__/ # Base configuration templates
└── *.yaml # Experiment configurations
script/ # Utility scripts
├── run_*.sh # Service startup scripts
├── show_*.py # Visualization tools
└── aggregate_*.py # Result aggregation tools

Registry System

Components are registered using decorators for dynamic instantiation:

from src.models import MODEL_REGISTRY
@MODEL_REGISTRY.register()
class MyModel(BaseModel):
 def __init__(self, ...):
 super().__init__()
 # Model implementation
# Custom name registration
@MODEL_REGISTRY.register("custom_name")
class AnotherModel(BaseModel):
 pass

Available registries:

MODEL_REGISTRY: Model architectures
DATASET_REGISTRY: Dataset implementations
TRANSFORM_REGISTRY: Data transformations
OPTIMIZER_REGISTRY: Optimizers
LR_SCHEDULER_REGISTRY: Learning rate schedulers
EVALUATOR_REGISTRY: Evaluation metrics

Configuration System

The configuration system supports inheritance and modular composition:

# config/my_experiment.yaml
__base__: "__base__/config.yaml"
# Override specific values
batch: 64
optimizer:
 lr: 0.001
 
# Import specific sections
transform:
 __import__: "__base__/transform/imagenet.yaml"

Error Handling and Notifications

The template includes comprehensive error handling:

Automatic Slack notifications for training completion and errors
Graceful error recovery with detailed logging
Checkpoint preservation even during failures
Distributed training fault tolerance

Development

Testing

# Run all tests
uv run --frozen pytest
# Run specific test modules
uv run --frozen pytest tests/test_modules.py
uv run --frozen pytest tests/test_slack_notification.py -v
# Run with verbose output
uv run --frozen pytest -v

Code Quality

# Format code
uv run --frozen ruff format .
# Check code quality
uv run --frozen ruff check .
# Fix auto-fixable issues
uv run --frozen ruff check . --fix

Documentation

# Start documentation server with live reload
./script/run_docs.sh

Docker Development

# Build development image
./docker/build.sh
# Run commands in container
./docker/run.sh python train.py config/dummy.yaml
./docker/run.sh bash # Interactive shell

mjun0812/PyTorch-Project-Template

Folders and files

Latest commit

History

Repository files navigation