Releases: antoinezambelli/forge

v0.7.5: reasoning replay is a bounded policy, default none

12 Jun 01:47

@antoinezambelli antoinezambelli

v0.7.5

5d01dfd

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified

Learn about vigilant mode.

v0.7.5: reasoning replay is a bounded policy, default none Latest

Latest

Reasoning replay is now a measured, bounded policy. Reasoning-capable backends return hidden reasoning alongside tool calls, and forge previously re-serialized all of it into backend-facing history on every later turn. The new reasoning_replay knob bounds that — and after a full re-sweep of the published eval grid showed that dropping replayed reasoning is quality-free and token-cheaper, the default is none. The release also re-baselines the Claude eval tier with extended thinking enabled and adds Anthropic prompt caching with cache-aware cost accounting.

Added

reasoning_replay {full, keep-last, none} on WorkflowRunner(reasoning_replay=...) and the proxy (--reasoning-replay). full replays every captured reasoning block (the historical behavior), keep-last only the most recent, none keeps reasoning out of backend-facing history entirely. Serialization-only: reasoning is still captured and still surfaces in on_message and internal history. In OpenAI-compatible proxy responses, keep-last exposes current reasoning as reasoning_content rather than assistant content, so clients that preserve reasoning fields can replay just the latest block. See ADR-017.
Reasoning-replay eval grid (eval_results_v0.7.5.jsonl, a new eval generation): the full 8–14B lineup re-swept across all three policies ×ばつ both ablations ×ばつ native/prompt — ~170k runs. The policy is part of the eval resume key and a first-class report/dashboard dimension: row labels carry :keep-last / :full tags (untagged = none), the dashboard gains a Reasoning Replay filter, the report a --reasoning-replay filter, and a dedicated reasoning-replay view compares policies per config. A wire-level counter (reasoning_wire) validates each policy's on-wire behavior (none → exactly 0 replayed reasoning across every run).
Anthropic extended thinking — AnthropicClient(thinking=...) — request-side extended-thinking config (e.g. {"type": "adaptive"}). When set, a forced tool_choice is suppressed (the API requires auto with thinking on) and max_tokens is raised to fit the thinking budget. The Claude eval baseline now runs Sonnet and Opus with adaptive thinking — all prior Claude rows had thinking off, the wrong baseline for a reasoning-flavored suite; Haiku does not support adaptive thinking and stays non-thinking.
Anthropic prompt caching — AnthropicClient(prompt_caching=True) — marks a static ephemeral cache breakpoint over the tool definitions + system prompt (byte-identical every turn, so it read-hits from turn 2 onward instead of re-billing the re-sent schema). TokenUsage gains generic cache_creation_input_tokens / cache_read_input_tokens counters, and eval cost accounting prices cache writes (×ばつ) and reads (×ばつ) at their actual rates.

Changed

Captured reasoning is no longer replayed to the backend by default. Pre-0.7.5 behavior replayed every captured reasoning block (equivalent to reasoning_replay="full"); the default is now "none". On the published eval suite, none is statistically indistinguishable from replay-all in aggregate while saving the replayed tokens every turn; no per-config regression survives multiple-comparison correction (closest: a mild raw drop on Ministral-3 14B Reasoning Q4, where none and keep-last are indistinguishable from each other). The knob is inert for models that emit no reasoning. Migration: --reasoning-replay full (proxy) or WorkflowRunner(reasoning_replay="full") restores the historical behavior. Anthropic-protocol proxy responses emit reasoning text only under full — forge does not synthesize signed Anthropic thinking blocks.

Assets 2

v0.7.4: malformed-args → tool-error channel + 32GB eval tier

03 Jun 05:48

@antoinezambelli antoinezambelli

v0.7.4

bd99f4d

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified

Learn about vigilant mode.

v0.7.4: malformed-args → tool-error channel + 32GB eval tier

[0.7.4] — 2026年06月03日

Malformed tool-call arguments now self-correct on the tool-error channel, and the eval suite gains its first model-size upgrade — a 32GB tier (Qwen3.5 / 3.6 27–35B, Nemotron-3 Nano, Mistral-Small-3.2) surfaced in the dashboard alongside the existing 8–14B lineup.

Added

Proxy --max-tool-errors (default 2) — bounds consecutive tool-argument errors per request, mirroring the WorkflowRunner budget. Threaded through ProxyServer and the HTTP handler.
32GB model tier in the published eval and dashboard: Mistral-Small-3.2 24B, Qwen3.5 27B / 35B-A3B, Qwen3.6 27B / 35B-A3B, Nemotron-3 Nano 30B-A3B (moved Unpublished → Current in the Model Registry).
Eval-generation tracking in the dashboard. Results gathered against different code states fold into a single view, deduped to the newest generation per config. Runs not yet re-swept (e.g. the Anthropic ablation) are carried forward and superscript-badged with a commit/date legend; Retired-tier models are carried forward but hidden behind a Show retired toggle.

Changed

Malformed tool-call arguments ride the tool-error channel. A model that emits a structurally valid call whose arguments are unparseable or not an object is now corrected via a tool-error result (role="tool", anchored to its tool_call_id) draining max_tool_errors, uniformly across all OpenAI-shape clients and all three integration modes (WorkflowRunner, proxy, Guardrails facade). This supersedes 0.7.3's "malformed args drive a retry nudge" behavior. The change is a native-mode conditioning bet — a small model plausibly self-corrects better on the channel it was pretrained on than via a trailing user nudge; in prompt mode the tool role is downgraded to a user message, so behavior there is unchanged. See ADR-016.
Guardrails.check() gains action="tool_error" for tool-call faults (unknown tool, malformed args) so middleware loops account for them on the tool channel. No consumers depended on the prior action vocabulary.
ToolCall / TextResponse are now plain dataclasses (args: Any); arg-shape validation moved to ResponseValidator. Attribute access and keyword construction are unchanged — but the pydantic .model_* API on these two exported types is gone, and construction no longer raises on a non-dict args. Only affects callers that serialized these objects via pydantic or relied on construction-time validation.

Fixed

Non-object tool args no longer crash the parser. Previously arguments decoding to a list / scalar / null raised at ToolCall construction; it is now caught at validation and routed to the tool-error channel. StepTracker.check_prerequisites additionally guards against a non-dict args reaching a direct dispatch.

Assets 2

v0.7.3: Native-first proxy

01 Jun 06:35

@antoinezambelli antoinezambelli

v0.7.3

92cbdc3

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified

Learn about vigilant mode.

v0.7.3: Native-first proxy

[0.7.3] — 2026年06月01日

Native-first proxy. With native function calling now well-supported across modern local models, the proxy defaults to — and is optimized for — native tool calling, forwarding the client's OpenAI tools / messages to the backend verbatim. Prompt-injection remains available as an explicit opt-in for llama.cpp / llamafile backends that lack a function-calling template, but it is no longer the default path. This release also folds in the OpenAI-compatible client and several proxy / eval fixes that landed on main since 0.7.2.

Added

OpenAICompatClient for arbitrary OpenAI-compatible endpoints. #89 (thanks @lucasgerads).
--backend-timeout proxy option — configurable backend response timeout (default 300s). #91.
--backend-capability {native,prompt} proxy flag — native (default) forwards the client's tools / messages verbatim to a function-calling-capable backend; prompt opts into prompt-injection for non-FC llama.cpp / llamafile backends. Declared once at startup and frozen — never probed or switched mid-stream.
Effective backend_timeout logged at proxy startup.

Changed

BREAKING — --mode {native,prompt} renamed to --backend-capability {native,prompt} (and ProxyServer(mode=...) → ProxyServer(backend_capability=...)). --mode collided with the proxy's managed / external deployment mode; the new name states what it controls — the backend's tool-calling protocol — and reflects that the choice is declared once and frozen, never probed at runtime. There is no deprecation alias (--mode was introduced in 0.7.1). Migration: --mode native → drop it (native is the default) or --backend-capability native; --mode prompt → --backend-capability prompt.
Native function calling is now transparent passthrough — the proxy forwards the client's OpenAI tool / message payloads to the backend verbatim instead of round-tripping them through forge's internal ToolSpec representation, which dropped schema detail.
vLLM model identity consolidated to a single source of truth (the wire model_path and the registry model key are now set together). #75.
The prompt capability is now rejected loudly for ollama / vllm / anthropic backends — previously it was silently ignored for ollama.
stream_options is excluded from proxy passthrough. #94 (thanks @alexandergunnarson).

Fixed

Consistent malformed-tool-call / unexpected-response handling across the OpenAI-shape clients — malformed model tool args drive a retry (TextResponse) instead of degrading silently or raising inconsistently, and non-streaming responses are guarded so a broken provider envelope fails loud.
Guardrails.record() no longer drops tool args for prerequisite tracking. #72 (thanks @hobostay).
Deprecated asyncio API replaced; proxy server input validation added. #71 (thanks @hobostay).
Proxy input hardening, non-blocking Ollama stop, client shutdown, and loud arg decode. #86.
Dead code and a fragile variable reference cleaned up in LlamafileClient. #73 (thanks @hobostay).

Removed

Runtime auto function-calling mode in LlamafileClient — the proxy never used it, and its mid-request probe-and-switch behavior is replaced by the declared-and-frozen --backend-capability.

Contributors

lucasgerads, alexandergunnarson, and hobostay

Assets 2

v0.7.2: vLLM backend support

24 May 20:17

@antoinezambelli antoinezambelli

v0.7.2

0f3e6fd

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified

Learn about vigilant mode.

v0.7.2: vLLM backend support

[0.7.2] — 2026年05月24日

vLLM backend support — serve AWQ/GPTQ and other vLLM-hosted models behind forge's guardrails, in both proxy modes and via WorkflowRunner.

Added

vLLM backend (VLLMClient). OpenAI-compatible client for a vLLM server, consuming vLLM's server-side tool_calls and reasoning (vLLM 0.21) fields. Native function calling only — vLLM parses tools server-side via --enable-auto-tool-choice --tool-call-parser, so there is no prompt-injection mode. Exported from forge and forge.clients.
vLLM in managed + external proxy modes. --backend vllm --model-path <dir|hf-repo-id> launches and manages a vLLM server; --backend-url <url> --backend vllm proxies an externally managed one. setup_backend() / ServerManager gain a model_path parameter (the vLLM identity, distinct from gguf_path).
vLLM served-model-name discovery in external mode. vLLM validates the request model field against its --served-model-name and 404s on a mismatch (unlike llama.cpp, which ignores the field). The proxy discovers the served name from /v1/models instead of sending a placeholder. #74 (thanks @srinathh).
vLLM section in Backend Setup covering the server flags and VLLMClient usage.

Changed

Proxy managed mode now delegates to setup_backend() instead of reimplementing the server-start/budget dance, so every managed backend (including vLLM) shares one path. No public API change — ProxyServer and the forge.proxy CLI keep their v0.7.1 signatures, with model_path / --model-path and the vllm backend added.
External mode fails fast when a backend reports no context length and no --budget-tokens is set, instead of silently falling back to an 8192-token budget that could truncate context. Anthropic-protocol downstreams are unaffected.

Known limitations

The vLLM backend is unit-validated but was not exercised against a live vLLM server in this release cycle. Its client and server-management code carry full unit coverage, and the proxy's protocol translation is verified end-to-end against llama.cpp (the proxy layer is backend-agnostic). scripts/integration_test_proxy.py --vllm-url <url> runs the full request battery against a real vLLM server when one is available.

Contributors

@srinathh

srinathh

Assets 2

v0.7.1: Anthropic Messages API support for Claude Code + proxy hardening

24 May 09:29

@antoinezambelli antoinezambelli

v0.7.1

f442399

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified

Learn about vigilant mode.

v0.7.1: Anthropic Messages API support for Claude Code + proxy hardening

[0.7.1] — 2026年05月24日

Proxy hardening: forge now works with Claude Code. First PyPI release to include the Docker, model-pass-through, and token-usage work that landed on main after v0.7.0.

Added

Anthropic Messages API on the proxy (POST /v1/messages). Point Claude Code — or any Anthropic-protocol client — at a forge-guarded model. Two downstream shapes: Path 2 (default, --backend-protocol openai) translates Anthropic ↔ OpenAI for local llama.cpp / Ollama and emits Anthropic SSE back; Path 1 (--backend-protocol anthropic, external mode) forwards to an Anthropic-shape downstream (LiteLLM, the Anthropic API, a self-hosted proxy), passing unknown fields through verbatim. Adds a base_url kwarg on AnthropicClient. See the new "Using forge with Claude Code" section in the User Guide.
--mode {native,prompt} proxy flag — run prompt-injected function-calling through the proxy for OpenAI-compatible backends that lack a native tool-calling template, not just native FC. Closes #53.
Real token-usage reporting through the proxy — responses carry actual prompt/completion counts (previously hardcoded zeros), in both OpenAI (usage.prompt_tokens/...) and Anthropic (usage.input_tokens/output_tokens) shapes, streaming and non-streaming. #81 (thanks @mhajder).
Per-request model-name pass-through for external backends — the proxy honors the inbound model against external OpenAI-compatible backends. #80 (thanks @mhajder).
Dockerfile for running the proxy as a container. #79 (thanks @mhajder).

Changed

last_usage unified on slot-keyed {slot_id: TokenUsage} across all clients. AnthropicClient previously stored a flat {input_tokens, output_tokens} dict; it now uses the slot-0 convention LlamafileClient / OllamaClient already follow, so usage extraction has one contract.
Inbound model rides the proxy's passthrough/extras channel rather than the sampling map — a cleaner replacement for the #80 mechanism that keeps model out of sampling.

Fixed

Proxy no longer hard-imports the optional anthropic SDK at load. A plain forge-guardrails install (without the [anthropic] extra) can now start the proxy for local / OpenAI-shape backends; the SDK is imported lazily and only required for --backend-protocol anthropic.
Proxy router tolerates query strings. Requests like Claude Code's POST /v1/messages?beta=true route correctly instead of returning 404.
eval_runner token accounting for local backends — was silently counting zero tokens because it read the flat last_usage keys; now reads the slot-keyed TokenUsage (fixed by the unification above).

Known limitations

cache_control is not preserved on Path 2. OpenAI Chat Completions has no analog, so prompt-cache hints are dropped when the downstream is a local OpenAI-shape backend. Path 1 (Anthropic-shape downstream) preserves cache_control on clean turns. See ADR-015.
Prompt-mode multi-turn tool convergence is model-dependent. Some models reliably consume prompt-injected tool results across turns; others re-call the same tool. Native FC is the more robust default for heavy multi-turn tool use (e.g. Claude Code).

Contributors

@mhajder

mhajder

Assets 2

v0.7.0: lineup refresh + tool-error channel + docs + MODEL_REGISTRY

22 May 06:34

@antoinezambelli antoinezambelli

v0.7.0

655e1f6

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified

Learn about vigilant mode.

v0.7.0: lineup refresh + tool-error channel + docs + MODEL_REGISTRY

[0.7.0] — 2026年05月22日

Added

Granite 4.1 8B + Gemma-4-E4B + phi-4 — added to the eval lineup. Granite 4.1 mirrors the IBM greedy-decoding convention pending formal published sampling guidance; phi-4 has no formal sampling recommendation and falls through to backend defaults.
_PROMPT_ONLY_MODELS in batch_eval — skips native FC for models lacking training for the OpenAI tool_calls schema (currently: phi-4, verified via curl 2026年05月14日).
_NO_RECOMMENDED_SAMPLING_MODELS in batch_eval — runs recommended_sampling=False for models without formal sampling guidance from any official source, so the eval doesn't raise UnsupportedModelError on them.
MODEL_REGISTRY.md — new doc enumerating every model forge knows about, classified as Current (in v0.7.0 eval), Retired (cut from current eval), or Unpublished (sampling params staged, no published eval). Sampling values, source links, identity-key conventions.
Versioned eval datasets — committed dataset files renamed to eval_results_vX.Y.Z.jsonl. Prior versions kept in LFS for reproducibility.
report.py --html + --markdown flags surfaced in README and EVAL_GUIDE examples.

Changed

Step enforcement + prerequisite violations surface on the tool channel. Previously, WorkflowRunner emitted these as trailing role="user" nudges after the assistant tool_call. v0.7.0 emits one role="tool" message per blocked call with [StepEnforcementError] / [PrereqError] prefixes — the canonical "tool call failed, try again" wire shape OpenAI-tool-trained models are pretrained on. Surfaced by v4 forge-code dogfooding (gpt-oss-120b reliably exhausted prerequisite-violation budget under the old shape).
Unknown-tool retry on the tool channel. Same refactor applied to ResponseValidator unknown-tool path: [UnknownToolError] tool-error reply instead of a user nudge.
Eval lineup refresh — cut Llama 3.1 8B, Mistral 7B v0.3, Mistral Nemo 12B, Granite 4.0 (h-micro / h-tiny). All scored bare <30% on the v0.6.0 dataset — too weak to be informative, superseded by Ministral-3 / Granite 4.1 / phi-4. Sampling defaults retained in sampling_defaults.py for backward compatibility (see MODEL_REGISTRY Retired tier).
Eval dataset — eval_results_v0.7.0.jsonl (96,200 rows, 74 cells; rig-01). Apples-to-apples delta on 21 common configs vs v0.6.0: +0.7pt overall, -1.2pt advanced_reasoning — both within CI. Published-leaderboard floor lifts +16.9pt via composition (weak-model cuts).
Dashboard + markdown views regenerated against v0.7.0 dataset. Top of leaderboard reshuffled: Ministral-3 14B Reasoning Q4 LS/N now #1 at 84.5% (was Ministral-3 8B Instruct Q8 LS/P at 86.5% in v0.6.0; now #3 at 84.4%).
MODEL_GUIDE rewrite — trimmed to opinions + rationale (333 → 145 lines). Full leaderboard, OG-18 100% list, hard suite top-5, models-to-avoid tables moved to the dashboard / markdown views. Sampling-parameters and "backend matters" sections retained. Native-vs-prompt heuristic corrected: not workload-driven, sensitivity is per-family.
ARCHITECTURE rebuild — cut signature restating (1701 → 165 lines); the doc now covers design principles, surface modes, guardrail rationale, compaction priority rationale, respond-tool rationale, sampling opt-in semantics. Source is authoritative for class signatures; WORKFLOW.md owns the diagrams; ADRs own past decisions.
BACKEND_SETUP rewrite — cut model-pick prose, Windows-specific install steps, Ollama Modelfile tutorial, llamafile distribution explainer, per-backend "run the eval" subsections, VRAM tables (360 → 135 lines). Per-backend section now: boot command + flag table + curl smoke-test + forge client snippet. Added Anthropic section using pip install "forge-guardrails[anthropic]".
README opener — leads with the contract (any tools, any order; structure opt-in via required_steps/prerequisites/terminal_tool) before the eval pitch. New "What forge isn't" (not an agent orchestrator, not a coding harness) preempts the conflations that surfaced on HN. Three-ways list reordered with proxy first (most popular entry point). Quick Start swapped from Ollama to llama-server.

Fixed

WorkflowRunner docstring + tree — added missing retry_nudge kwarg, cancel_event parameter on run(), PREREQUISITE_NUDGE + CONTEXT_WARNING message types, MaxIterationsError / PrerequisiteError / StepEnforcementError / WorkflowCancelledError in Raises lists across docs.
CompactStrategy + ContextManager signatures in docs — trigger_tokens → budget_tokens (the strategy owns its own threshold logic now); compact_threshold → context_thresholds + on_context_threshold callbacks.
LlamafileClient constructor docs — added missing sampling kwargs (top_p, top_k, min_p, repeat_penalty, presence_penalty), chat_template_kwargs, slot_id.
MODEL_FAMILIES in report.py — added entries for granite-4.1-8b (Q4/Q8) and phi-4-Q4_K_M so cross-backend rollups in by-backend.md group these new models correctly.
WORKFLOW.md agentic-loop flowchart — node names + edges updated to reflect the tool-error wire shape (STEP_TOOL_ERROR, PREREQ_TOOL_ERROR, UNKNOWN_TOOL_ERROR); compaction-priority table fixed (step_nudge and prerequisite_nudge are role=tool, retry_nudge remains role=user).
Stale bfcl/ reference removed from WORKFLOW.md module diagram (directory was removed pre-v0.7.0; ADR-009 retained as historical artifact).

Known limitations

Anthropic numbers not re-measured in v0.7.0. The Anthropic ablation matrix (~272ドル to run) was not re-executed for v0.7.0. Numbers cited in any v0.7.0 doc are from the v0.6.0 dataset (eval_results_v0.6.0.jsonl). Tool-error-channel changes affect frontier models' wire on guardrail-fire paths too, but expected movement is small.

Assets 2

Releases: antoinezambelli/forge

v0.7.5: reasoning replay is a bounded policy, default none

Added

Changed

Uh oh!

v0.7.4: malformed-args → tool-error channel + 32GB eval tier

[0.7.4] — 2026年06月03日

Added

Changed

Fixed

Uh oh!

v0.7.3: Native-first proxy

[0.7.3] — 2026年06月01日

Added

Changed

Fixed

Removed

Contributors

Uh oh!

v0.7.2: vLLM backend support

[0.7.2] — 2026年05月24日

Added

Changed

Known limitations

Contributors

Uh oh!

v0.7.1: Anthropic Messages API support for Claude Code + proxy hardening

[0.7.1] — 2026年05月24日

Added

Changed

Fixed

Known limitations

Contributors

Uh oh!

v0.7.0: lineup refresh + tool-error channel + docs + MODEL_REGISTRY

[0.7.0] — 2026年05月22日

Added

Changed

Fixed

Known limitations

Uh oh!