Got into model serving and inference, enjoyed solving cold start, intelligent routing and optimizing GPU cluster utilization. Did a bit of RAG & Agents infra. Currently ML Infra - training, inference, comms collectives, storage, compiler backends, custom kernels optimizations & researching novel techniques.

High Agency individual deep in agentic-engineering mode. AI tools have enabled me to touch end-to-end infra from user facing APIs & Infra to tensors to metal. Always looking to maximize my learning curve 📈

ʕ•ᴥ•ʔ venkat.systems

Highlights

Projects

🐨 WombatKV - KV blocks survive restarts, save prefill flops - Object-storage-native KV cache for Inference.
⚡ myelon - HFT-grade LMAX-Disruptor multiprocess IPC over SHM & mmap. 240 ns P99 · 5.58 M ops/s · 92.6 GB/s.
🐘 YALI - Ultra-low-latency GPU comms collective. Outperforms NVIDIA NCCL P2P by 1.2 - 2.4x.
🪢 GPU Kernel Batcher - Batching identical GEMMs into one cuBLAS call - 90%+ fewer launches, 22% faster FP16 workloads
⏲️ Metered Compute - 5 reference architectures for reliably metering sync and async compute.
🔍 Inference Assayer - Compiler driven models <> HWs inference perf analyzing deterministic fast simulator lab.

Technologies

Home Codex Claude CLI macOS pi.dev Tailscale tmux AutoResearch

Languages Rust Go Python Java CUDA English Markdown does it matter anymore?

Inference vLLM SGLang TensorRT-LLM Transformers

Infra K8s Helm Argo Docker NVIDIA Dynamo vLLM AIBrix

Accelerators PyTorch Triton CUTLASS CuBLAS Mojo ThunderKittens

Storage MySQL PostgreSQL Redis S3 SlateDB

Middleware Kafka Apache Iggy NATS Redpanda ZeroMQ RabbitMQ

Cloud AWS GCP Terraform Ansible

Build Earthly Makefile Bash Bazel

Writings

Hashnode Medium Blogger

Acknowledgements

Inspired by

profile views

Pinned Loading

wombatkv wombatkv Public

Object-storage-native KV cache for LLM inference & RL. Cross-restart, cross-conversation, cross-engine via shared S3 bucket.

Rust 13 1
myelon myelon Public

Ultra-low-latency, high-throughput multiprocess transport over SHM and mmap. LMAX-Disruptor-style cross-process ring substrate.

Rust 11 1
yali yali Public

Speed-of-Light SW efficiency by using ultra low-latency primitives for comms collectives

Cuda 13
ai-dynamo/dynamo ai-dynamo/dynamo Public

A Datacenter Scale Distributed Inference Serving Framework

Rust 7.3k 1.2k
sgl-project/sglang sgl-project/sglang Public

SGLang is a high-performance serving framework for large language models and multimodal models.

Python 29k 6.5k
vllm-project/aibrix vllm-project/aibrix Public

Cost-efficient and pluggable Infrastructure components for GenAI inference

Go 4.9k 600

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Venkat Raman Venkat2811

Achievements

Achievements

Organizations

Block or report Venkat2811