Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings
@Venkat2811
Venkat2811
Follow

Venkat Raman Venkat2811

🎯
Focusing
staff engineer, oss, distributed systems, low latency, inference

Organizations

@wso2-incubator

Block or report Venkat2811

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Venkat2811 /README.md

Mechanical Sympathy is All You Need

Hi πŸ‘‹, I'm Venkat !

Twitter LinkedIn GitHub

Built and scaled systems that can handle 5k->250k RPS w/o breaking a sweat.

Got into model serving and inference, enjoyed solving cold start, intelligent routing and optimizing GPU cluster utilization. Did a bit of RAG & Agents infra. Currently ML Infra - training, inference, comms collectives, storage, compiler backends, custom kernels optimizations & researching novel techniques.

High Agency individual deep in agentic-engineering mode. AI tools have enabled me to touch end-to-end infra from user facing APIs & Infra to tensors to metal. Always looking to maximize my learning curve πŸ“ˆ

Κ•β€’α΄₯β€’Κ” venkat.systems


Highlights


Projects

  • 🐨 WombatKV - KV blocks survive restarts, save prefill flops - Object-storage-native KV cache for Inference.
  • ⚑ myelon - HFT-grade LMAX-Disruptor multiprocess IPC over SHM & mmap. 240 ns P99 Β· 5.58 M ops/s Β· 92.6 GB/s.
  • 🐘 YALI - Ultra-low-latency GPU comms collective. Outperforms NVIDIA NCCL P2P by 1.2 - 2.4x.
  • πŸͺ’ GPU Kernel Batcher - Batching identical GEMMs into one cuBLAS call - 90%+ fewer launches, 22% faster FP16 workloads
  • ⏲️ Metered Compute - 5 reference architectures for reliably metering sync and async compute.
  • πŸ” Inference Assayer - Compiler driven models <> HWs inference perf analyzing deterministic fast simulator lab.

Technologies


Writings

Hashnode Medium Blogger

Acknowledgements

Inspired by


profile views

Pinned Loading

  1. wombatkv wombatkv Public

    Object-storage-native KV cache for LLM inference & RL. Cross-restart, cross-conversation, cross-engine via shared S3 bucket.

    Rust 13 1

  2. myelon myelon Public

    Ultra-low-latency, high-throughput multiprocess transport over SHM and mmap. LMAX-Disruptor-style cross-process ring substrate.

    Rust 11 1

  3. yali yali Public

    Speed-of-Light SW efficiency by using ultra low-latency primitives for comms collectives

    Cuda 13

  4. ai-dynamo/dynamo ai-dynamo/dynamo Public

    A Datacenter Scale Distributed Inference Serving Framework

    Rust 7.3k 1.2k

  5. sgl-project/sglang sgl-project/sglang Public

    SGLang is a high-performance serving framework for large language models and multimodal models.

    Python 29k 6.5k

  6. vllm-project/aibrix vllm-project/aibrix Public

    Cost-efficient and pluggable Infrastructure components for GenAI inference

    Go 4.9k 600

AltStyle γ«γ‚ˆγ£γ¦ε€‰ζ›γ•γ‚ŒγŸγƒšγƒΌγ‚Έ (->γ‚ͺγƒͺγ‚ΈγƒŠγƒ«) /