Mechanical Sympathy is All You Need
Built and scaled systems that can handle 5k->250k RPS w/o breaking a sweat.
Got into model serving and inference, enjoyed solving cold start, intelligent routing and optimizing GPU cluster utilization. Did a bit of RAG & Agents infra. Currently ML Infra - training, inference, comms collectives, storage, compiler backends, custom kernels optimizations & researching novel techniques.
High Agency individual deep in agentic-engineering mode. AI tools have enabled me to touch end-to-end infra from user facing APIs & Infra to tensors to metal. Always looking to maximize my learning curve π
- π¨ WombatKV - KV blocks survive restarts, save prefill flops - Object-storage-native KV cache for Inference.
- β‘ myelon - HFT-grade LMAX-Disruptor multiprocess IPC over SHM & mmap. 240 ns P99 Β· 5.58 M ops/s Β· 92.6 GB/s.
- π YALI - Ultra-low-latency GPU comms collective. Outperforms NVIDIA NCCL P2P by 1.2 - 2.4x.
- πͺ’ GPU Kernel Batcher - Batching identical GEMMs into one cuBLAS call - 90%+ fewer launches, 22% faster FP16 workloads
- β²οΈ Metered Compute - 5 reference architectures for reliably metering sync and async compute.
- π Inference Assayer - Compiler driven models <> HWs inference perf analyzing deterministic fast simulator lab.
Inspired by