Product @ Pre6 AI · I build production-grade AI agent systems.
Product by title, builder by craft — I design AI products and ship the engineering behind them: multi-agent orchestration, agent evaluation, and AI safety infrastructure.
I care about the unglamorous half of AI products — the part that decides whether they survive contact with real users. Most demos route a single LLM call. Production systems need orchestration, evaluation, safety gates, and observability. That gap is what I build into.
- Multi-agent orchestration — supervisor/specialist architectures with typed state, tool binding, and streaming traces.
- Agent reliability — measurable, auditable evaluation of agent runs across reliability, safety, latency, and cost.
- LLM safety — scanning retrieval context for prompt injection, secret leakage, PII, and exfiltration before it reaches a model.
- Developer tooling — sharp CLIs that turn fuzzy engineering signals into decisions teams can act on.
| Project | What it is | Stack | Links |
|---|---|---|---|
| nabla | A reverse-mode autograd engine you can watch think — the algorithm behind PyTorch/JAX, from scratch, with an interactive visualizer that animates backprop through the computation graph. Gradient-checked to 1e-10. | Python · CI | Live Demo · Code |
| mosaic | A byte-pair-encoding tokenizer you can see — train a real BPE on your own text and watch any string break into a mosaic of tokens. Zero-dependency, lossless round-trips. | Python · CI | Live Studio · Code |
| winnow | Budget-aware context compression for RAG and agents — BM25 relevance + MMR diversity packs the highest-signal context into a token budget. Deterministic, zero runtime deps, no API keys, with a reproducible benchmark. | Python · CI | Live Demo · Code |
| warren | From-scratch HNSW approximate-nearest-neighbor index — the graph algorithm behind vector databases. Recall@10 of 0.99+ while scanning ~5% of the database, measured against exact search. | Python · NumPy · CI | Live Demo · Code |
| stencil | Constrained decoding — compiles a JSON Schema to a DFA and masks an LLM's tokens so invalid output is impossible. 100% valid by construction vs ~0% unconstrained. | Python · CI | Live Demo · Code |
| mend | Repairs malformed JSON from LLMs into valid JSON — fences, single quotes, trailing commas, truncated output. Recovers 16/16 real-world defects vs stdlib's 0. | Python · CI | Live Demo · Code |
| gemma4-multi-agent | Multi-agent system — a Supervisor routes work across 4 specialist agents with live reasoning traces and sandboxed tool execution. | Python · LangGraph · Gemini · Streamlit | Code |
| agent-evals-lab | Evaluation workbench for agent reliability — typed scoring engine, policy rules, regression detection, and a trace-inspection dashboard. | TypeScript · React · CI | Live Demo · Code |
| verdict | Adversarial LLM red-teaming platform — runs PAIR, Crescendo, and injection attacks against any model, then reports attack-success-rate metrics with per-category breakdowns and HTML reports. | Python · CI | Code |
| rag-safety-gateway | AI security gateway that scans RAG context for prompt injection, secrets, PII, and exfiltration risk, producing deterministic allow/redact/quarantine decisions. | TypeScript · React · CI | Live Demo · Code |
| hermes | Test-time compute scaling engine — gives any LLM o1-style reasoning search via Process Reward Models, MCTS, and beam search. | Python · CI | Code |
Every featured project ships with tests, CI, and documentation — clone, run, and review the design in minutes.
repo-pulse generating a real engineering-health report
repo-pulse — one of my CLIs, generating a real engineering-health report with no keys or config.
Typed contracts first → domain models before logic, so behavior is auditable
Deterministic by default → scoring and decisions reproducible without a live model
Measurable, then pretty → evals and telemetry before dashboards
Reviewable in 60 seconds → clone, run, understand — no API keys to start
Python · TypeScript · LangGraph · LangChain · React · Streamlit · Google Gemini · OpenAI · pytest · Vitest · GitHub Actions · uv
Open to conversations on AI agent engineering, evals, and LLM safety.