Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Danultimate/traceforge

Repository files navigation

TraceForge

Agent runtime tracing + LLM-mock replay for Python. Pip install. Async-first. Self-contained reports.

PyPI version Python versions License: Apache 2.0 CI Replay

TraceForge HTML report

⚠️ Pre-release (v0.2). Tracer, replay, instrumentors, cost tracking, and the pytest plugin are implemented end-to-end and tested. APIs are stabilising — don't depend on this in production yet.

TraceForge records every LLM call, tool invocation, error, and state transition your agent makes into a typed span. The output is a replayable run.jsonl artifact plus a self-contained HTML report you can open in any browser — no server, no SaaS, no SDK lock-in. Replay mode re-executes the agent with cached LLM responses (or cached tool outputs) so you can verify the execution path without burning API calls.

pip install "traceforge-llm[anthropic]" # or [openai], [all]
traceforge init && python agent.py

Why TraceForge

TraceForge LangSmith Langfuse OpenLLMetry print()
Pip install, no account, no server partial (self-host)
Records LLM I/O + tool I/O + state per span partial
Replay with cached LLM responses (llm-mock)
Dry-run replay with cached tool outputs
Self-contained HTML report (no CDN, no server)
Auto-cost tracking per-span + per-run partial
First-class pytest plugin with snapshot testing
Auto-patches your SDK clients opt-in n/a
Cloud storage / hosted dashboard via vendor

Where TraceForge fits: when you need a local, file-based, replayable record of what your agent did — for debugging, CI regression tests, or post-hoc analysis — without sending your traces to anyone else's database. Auto-patching frameworks like LangSmith give you a UI; OpenLLMetry gives you OTel pipes. TraceForge gives you a JSONL you can git diff, an HTML you can email, and a tracer.replay() you can run offline.


60-second quickstart

1. Install and scaffold.

pip install "traceforge-llm[anthropic]"
traceforge init

traceforge init writes traceforge.yaml, a working agent.py example, and a .gitignore entry.

2. Wrap your agent.

import asyncio
from anthropic import AsyncAnthropic
from traceforge import Tracer
from traceforge.integrations.anthropic import AnthropicInstrumentor
tracer = Tracer()
async def main():
 async with tracer.run() as run:
 client = AnthropicInstrumentor(run).instrument(AsyncAnthropic())
 response = await client.messages.create(
 model="claude-haiku-4-5-20251001",
 max_tokens=256,
 messages=[{"role": "user", "content": "What is 2 + 2?"}],
 )
 print(response.content[0].text)
 run.trace.print_summary()
asyncio.run(main())

3. Run.

export ANTHROPIC_API_KEY=sk-ant-...
python agent.py

You get a Rich-formatted summary on stdout, plus a directory at .traceforge/runs/<ulid>-<run-name>/ containing manifest.json, run.jsonl, and a self-contained report.html.

Library-only API (no instrumentor)
async with tracer.run() as run:
 run.record_llm_call(
 provider="anthropic",
 model="claude-haiku-4-5",
 messages=[...],
 response="...",
 input_tokens=12, output_tokens=4, latency_ms=180,
 )
 run.record_tool_call("search", tool_input={"q": "..."}, tool_output={"hits": 3})
 run.custom("phase.done", metadata={"step": 1})
run.trace.print_summary()

Manual recording is the lower-level API the instrumentors are built on. Useful when you don't want TraceForge anywhere near the SDK call.

Decorator sugar
@tracer.trace
async def my_agent(query, _run=None):
 _run.record_tool_call("search", {"q": query}, {"hits": 3})
 return "done"
await my_agent("hello") # auto-saves trace to .traceforge/runs/
trace = tracer.last()

Reports

tracer.run() writes three files per run to .traceforge/runs/<ulid>-<run-name>/:

.traceforge/runs/01KS8E...-true-elk/
├── report.html ← open this
├── run.jsonl ← replayable artifact (one span per line + manifest)
└── manifest.json ← aggregate counts + cost + token totals

Terminal: run.trace.print_summary() prints a Rich panel + span tree:

Terminal report

HTML: self-contained, dark theme, no CDN. Open report.html in any browser (or via traceforge open <run-name>):

  • Stat strip at the top: duration, span count, LLM calls, tool calls, token totals, cost, errors
  • Span cards with type-coded left border (indigo LLM, cyan tool, slate custom, red errors)
  • Collapsible payloads for system prompts, message arrays, and tool I/O — kept folded by default so the page stays scannable
  • Per-span cost rendered inline for every LLM call

A live example sits at docs/example-report.html — open it directly to see the layout.


Replay

# llm-mock: LLM responses served from cache, tools execute live
result = await tracer.replay(trace, agent_fn, mode="llm-mock")
# dry-run: both LLM responses AND tool outputs served from cache, no network
result = await tracer.replay(trace, agent_fn, mode="dry-run")
result.print()
# Similarity: 100% · Status: ALIGNED

The replay engine builds two interceptors keyed by SHA-256 of the original messages / tool inputs, and hands them to your agent function. Your agent consults the interceptor before calling out:

async def my_agent(query, _run=None, _mock_llm=None, _mock_tool=None):
 cached = _mock_llm.get(messages) if _mock_llm else None
 if cached is not None:
 return cached
 return await client.messages.create(...)

The shipped instrumentors handle this for you — just pass mock_interceptor=_mock_llm when instrumenting:

client = AnthropicInstrumentor(_run, mock_interceptor=_mock_llm).instrument(AsyncAnthropic())

Similarity scoring. ReplayResult.similarity_score is the ratio of matching span types between original and replayed traces. Below 0.4 the replay is marked DIVERGED. See docs/replay-faq.md for why replays diverge and how to fix it.


Cost tracking

Every LLM span gets a USD cost estimate attached automatically, looked up from a built-in pricing table for the major Anthropic and OpenAI models (with longest-prefix matching, so versioned IDs like claude-haiku-4-5-20251001 resolve correctly). Aggregates flow into manifest.total_cost_usd.

async with tracer.run() as run:
 await client.messages.create(...) # instrumentor records cost
print(run.trace.manifest.total_cost_usd) # → 0.0042

Override the table for negotiated contract pricing or private models:

from traceforge import Tracer
from traceforge.pricing import ModelPrice
tracer = Tracer(pricing={
 "my-internal-model": ModelPrice(input_per_million=0.5, output_per_million=1.5),
 "claude-opus-4-7": ModelPrice(input_per_million=12.0, output_per_million=60.0),
})

Unknown models cost 0 and emit a one-shot warning so the trace still saves.


Pytest plugin

pip install traceforge-llm auto-registers a pytest plugin (via pytest11 entry point). Three fixtures appear in any test suite:

import pytest
@pytest.mark.asyncio
async def test_agent_runs_under_budget(tracer, tf_assert, tf_snapshot):
 async with tracer.run() as run:
 await my_agent("hello", _run=run)
 tf_assert(
 run.trace,
 has_span="search",
 llm_calls=1,
 max_cost_usd=0.01,
 max_tokens=2000,
 )
 # Golden-trace snapshot — fails if the span-type sequence drifts.
 tf_snapshot.assert_match(run.trace, "agent_v1")
Fixture What it gives you
tracer A non-auto-saving Tracer per test (no .traceforge/ cruft)
tf_assert(trace, ...) One-line common assertions: has_span, has_span_type, no_errors, llm_calls, tool_calls, max_cost_usd, max_tokens, min_spans
tf_snapshot.assert_match(trace, name) Records the trace on first run, then asserts span-type-sequence similarity ≥ 0.8 on every subsequent run

Snapshots live in tests/__tf_snapshots__/<name>.jsonl by default — override with --tf-snapshot-dir. Commit them like you commit any other golden file. Refresh after intentional changes with:

pytest --tf-update-snapshots

Instrumentors

Provider Class Wraps
Anthropic traceforge.integrations.anthropic.AnthropicInstrumentor client.messages.create
OpenAI traceforge.integrations.openai.OpenAIInstrumentor client.chat.completions.create
LangChain traceforge.integrations.langchain.LangChainInstrumentor manual record_chain_step / record_llm_step

LangChain is intentionally manual — auto-patching is fragile across versions, so TraceForge ships a bridge helper you call from your callback handler.


CLI

Command Purpose
traceforge init Scaffold traceforge.yaml, agent.py example, .gitignore entry
traceforge list Table of local runs (newest first)
traceforge open <id> Open a run's HTML report in your browser
traceforge show <id> Print a run summary to the terminal

<id> accepts a ULID prefix or the human-readable run name (brave-salmon).


Non-goals

  • No auto-patching by default. Instrumentors are opt-in. Your code stays explicit about what's being traced.
  • No time-travel debugging. TraceForge records and replays; it does not pause your agent mid-flight.
  • No cloud storage. Traces live in .traceforge/runs/. Bring your own object store if you want central retention.
  • No built-in eval scoring. TraceForge captures the run; pair it with evalkit (or your own scorer) for grading.

Status

Feature Status
Async + sync context manager, @tracer.trace decorator ✅ shipped
Anthropic / OpenAI instrumentors ✅ shipped
LangChain bridge (manual) ✅ shipped
File store: manifest.json + run.jsonl + report.html ✅ shipped
Self-contained HTML report (no CDN) ✅ shipped
LLM-mock replay ✅ shipped
Dry-run replay (tool cache) ✅ shipped
Cost tracking (per-span + manifest total) ✅ shipped (v0.2)
Custom pricing tables ✅ shipped (v0.2)
Pytest plugin (tracer, tf_assert) ✅ shipped (v0.2)
Trace snapshot testing (tf_snapshot) ✅ shipped (v0.2)
Streaming + tool-use in instrumentors deferred
traceforge diff (span-level diff) deferred
Slim mode (--slim) deferred
LangGraph auto-instrumentation manual only
Cloud storage backends non-goal

Track progress and propose features via GitHub Issues.


Docs

License

Apache 2.0.

Packages

Contributors

Languages

AltStyle によって変換されたページ (->オリジナル) /