prefix-caching

Here are 10 public repositories matching this topic...

Language: All

Filter by language

All 10 Python 3 C++ 2 Go 1 HTML 1 Rust 1 Shell 1

ruipeterpan / marconi

Artifact for "Marconi: Prefix Caching for the Era of Hybrid LLMs" [MLSys '25 Outstanding Paper Award, Honorable Mention]

llm-inference mamba-state-space-models prefix-caching hybrid-llm

Updated Mar 5, 2025
Python

rithulkamesh / continuum

Sponsor

Star 5

Unified execution runtime for LLM and ML programs.

machine-learning deep-learning transformers pytorch agents execution-engine kv-cache dataflow-graph llm generative-ai ai-runtime workflow-optimization program-optimization prefix-caching agent-runtime llm-runtime compiler-runtime cross-call-caching

Updated May 1, 2026
C++

theogravity / dual-rtx-6000-blackwell-Gemma-4-31B-IT-NVFP4

Sponsor

Star 5

Optimized vLLM setup for Gemma 4 31B NVFP4 with MTP on dual RTX PRO 6000 Blackwell using vllm and docker: native FP4 Tensor Cores, Multi-Token Prediction (96.5% acceptance rate), and prefix caching. Includes benchmark results and replication scripts.

docker amd cuda gemma blackwell vllm llm-inference am5 speculative-decoding fp4 prefix-caching multi-token-prediction nvfp4 rtx-6000 gemma4 tensor-parallel

Updated May 10, 2026
Shell

swarmkv

AnubhabBanerjee / swarmkv

Star 3

C++ inference runtime for llama.cpp that shares a single document KV-cache prefill across multiple analytical branches via snapshot fan-out. Eliminates redundant GPU compute and dramatically reduces TTFT in DAG-based multi-agent pipelines.

multi-agent dag rag kv-cache llm-inference prefix-caching

Updated Jun 10, 2026
C++

zxuhan / llm-router

Star 2

Cache-aware router for OpenAI-compatible LLM servers, in Go. Per-worker radix trees route each request to the worker holding its KV prefix. Validated on 4x A100 + vLLM and Apple Silicon + llama.cpp.

go inference load-balancer kv-cache llm llama-cpp vllm prefix-caching

Updated May 24, 2026
Go

developertogo / velo-core

Star 2

A production-grade, native Rust speculative inference engine for Apple Silicon with Metal GPU acceleration and paged attention.

metal gpu-acceleration systems-programming apple-silicon openai-api tensor-parallelism llm-inference speculative-decoding paged-attention continuous-batching prefix-caching disaggregated-serving

Updated Jun 13, 2026
Rust

dr-gareth-roberts / context-engineering

Star 2

Context engineering toolkit for LLMs — pack, cache, debug, red-team, and orchestrate context windows. Council of Experts, adversarial testing, immune system, context compiler, drift detection, multi-agent entanglement. TypeScript + Python.

python typescript ai multi-agent rag llm prompt-engineering llm-tools context-window prefix-caching context-engineering adversarial-testing token-budget council-of-experts context-packing

Updated Jun 12, 2026
Python

superAttention / llm-prefix-cache-analysis

Star 0

Benchmarking LLM prefix-cache eviction policies against Tree-Constrained Belady on ShareGPT traces.

radix-tree lru-cache page-replacement-algorithm kv-cache llm-serving vllm llm-inference sglang prefix-caching

Updated May 2, 2026
HTML

derLogik / qwen3-prefix-cache-bench

Star 0

Reproducible Qwen3-1.7B prefix cache benchmark on RTX 3070 Laptop (8GB). Hand-written reference inference loop + vLLM v1 comparison.

cuda gpu-performance kv-cache vllm llm-inference qwen prefix-caching

Updated May 17, 2026

rohanarcot / ECUA-OSWorld-OpenCUA

Star 0

Edge-optimized OpenCUA-7B computer-use agent evaluated on OSWorld, exploring systematic vLLM inference optimizations across CPU and GPU, including precision tuning, image history management, speculative decoding, and prefix caching.

quantization agents multimodal inference-optimization edge-ai vllm speculative-decoding gui-agents prefix-caching osworld opencua

Updated Dec 18, 2025
Python

Improve this page

Add a description, image, and links to the prefix-caching topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the prefix-caching topic, visit your repo's landing page and select "manage topics."

Learn more

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prefix-caching

Here are 10 public repositories matching this topic...

ruipeterpan / marconi

rithulkamesh / continuum

theogravity / dual-rtx-6000-blackwell-Gemma-4-31B-IT-NVFP4

AnubhabBanerjee / swarmkv

zxuhan / llm-router

developertogo / velo-core

dr-gareth-roberts / context-engineering

superAttention / llm-prefix-cache-analysis

derLogik / qwen3-prefix-cache-bench

rohanarcot / ECUA-OSWorld-OpenCUA

Improve this page

Add this topic to your repo