-
Notifications
You must be signed in to change notification settings - Fork 144
Releases: antoinezambelli/forge
v0.7.5: reasoning replay is a bounded policy, default none
5d01dfd Reasoning replay is now a measured, bounded policy. Reasoning-capable backends return hidden reasoning alongside tool calls, and forge previously re-serialized all of it into backend-facing history on every later turn. The new reasoning_replay knob bounds that — and after a full re-sweep of the published eval grid showed that dropping replayed reasoning is quality-free and token-cheaper, the default is none. The release also re-baselines the Claude eval tier with extended thinking enabled and adds Anthropic prompt caching with cache-aware cost accounting.
Added
reasoning_replay {full, keep-last, none}onWorkflowRunner(reasoning_replay=...)and the proxy (--reasoning-replay).fullreplays every captured reasoning block (the historical behavior),keep-lastonly the most recent,nonekeeps reasoning out of backend-facing history entirely. Serialization-only: reasoning is still captured and still surfaces inon_messageand internal history. In OpenAI-compatible proxy responses,keep-lastexposes current reasoning asreasoning_contentrather than assistantcontent, so clients that preserve reasoning fields can replay just the latest block. See ADR-017.- Reasoning-replay eval grid (
eval_results_v0.7.5.jsonl, a new eval generation): the full 8–14B lineup re-swept across all three policies ×ばつ both ablations ×ばつ native/prompt — ~170k runs. The policy is part of the eval resume key and a first-class report/dashboard dimension: row labels carry:keep-last/:fulltags (untagged =none), the dashboard gains a Reasoning Replay filter, the report a--reasoning-replayfilter, and a dedicated reasoning-replay view compares policies per config. A wire-level counter (reasoning_wire) validates each policy's on-wire behavior (none→ exactly 0 replayed reasoning across every run). - Anthropic extended thinking —
AnthropicClient(thinking=...)— request-side extended-thinking config (e.g.{"type": "adaptive"}). When set, a forcedtool_choiceis suppressed (the API requiresautowith thinking on) andmax_tokensis raised to fit the thinking budget. The Claude eval baseline now runs Sonnet and Opus with adaptive thinking — all prior Claude rows had thinking off, the wrong baseline for a reasoning-flavored suite; Haiku does not support adaptive thinking and stays non-thinking. - Anthropic prompt caching —
AnthropicClient(prompt_caching=True)— marks a static ephemeral cache breakpoint over the tool definitions + system prompt (byte-identical every turn, so it read-hits from turn 2 onward instead of re-billing the re-sent schema).TokenUsagegains genericcache_creation_input_tokens/cache_read_input_tokenscounters, and eval cost accounting prices cache writes (×ばつ) and reads (×ばつ) at their actual rates.
Changed
- Captured reasoning is no longer replayed to the backend by default. Pre-0.7.5 behavior replayed every captured reasoning block (equivalent to
reasoning_replay="full"); the default is now"none". On the published eval suite,noneis statistically indistinguishable from replay-all in aggregate while saving the replayed tokens every turn; no per-config regression survives multiple-comparison correction (closest: a mild raw drop on Ministral-3 14B Reasoning Q4, wherenoneandkeep-lastare indistinguishable from each other). The knob is inert for models that emit no reasoning. Migration:--reasoning-replay full(proxy) orWorkflowRunner(reasoning_replay="full")restores the historical behavior. Anthropic-protocol proxy responses emit reasoning text only underfull— forge does not synthesize signed Anthropic thinking blocks.
Assets 2
v0.7.4: malformed-args → tool-error channel + 32GB eval tier
bd99f4d [0.7.4] — 2026年06月03日
Malformed tool-call arguments now self-correct on the tool-error channel, and the eval suite gains its first model-size upgrade — a 32GB tier (Qwen3.5 / 3.6 27–35B, Nemotron-3 Nano, Mistral-Small-3.2) surfaced in the dashboard alongside the existing 8–14B lineup.
Added
- Proxy
--max-tool-errors(default 2) — bounds consecutive tool-argument errors per request, mirroring theWorkflowRunnerbudget. Threaded throughProxyServerand the HTTP handler. - 32GB model tier in the published eval and dashboard: Mistral-Small-3.2 24B, Qwen3.5 27B / 35B-A3B, Qwen3.6 27B / 35B-A3B, Nemotron-3 Nano 30B-A3B (moved Unpublished → Current in the Model Registry).
- Eval-generation tracking in the dashboard. Results gathered against different code states fold into a single view, deduped to the newest generation per config. Runs not yet re-swept (e.g. the Anthropic ablation) are carried forward and superscript-badged with a commit/date legend; Retired-tier models are carried forward but hidden behind a
Show retiredtoggle.
Changed
- Malformed tool-call arguments ride the tool-error channel. A model that emits a structurally valid call whose
argumentsare unparseable or not an object is now corrected via a tool-error result (role="tool", anchored to itstool_call_id) drainingmax_tool_errors, uniformly across all OpenAI-shape clients and all three integration modes (WorkflowRunner, proxy,Guardrailsfacade). This supersedes 0.7.3's "malformed args drive a retry nudge" behavior. The change is a native-mode conditioning bet — a small model plausibly self-corrects better on the channel it was pretrained on than via a trailing user nudge; in prompt mode the tool role is downgraded to a user message, so behavior there is unchanged. See ADR-016. Guardrails.check()gainsaction="tool_error"for tool-call faults (unknown tool, malformed args) so middleware loops account for them on the tool channel. No consumers depended on the prior action vocabulary.ToolCall/TextResponseare now plain dataclasses (args: Any); arg-shape validation moved toResponseValidator. Attribute access and keyword construction are unchanged — but the pydantic.model_*API on these two exported types is gone, and construction no longer raises on a non-dictargs. Only affects callers that serialized these objects via pydantic or relied on construction-time validation.
Fixed
- Non-object tool args no longer crash the parser. Previously
argumentsdecoding to a list / scalar /nullraised atToolCallconstruction; it is now caught at validation and routed to the tool-error channel.StepTracker.check_prerequisitesadditionally guards against a non-dictargsreaching a direct dispatch.
Assets 2
v0.7.3: Native-first proxy
92cbdc3 [0.7.3] — 2026年06月01日
Native-first proxy. With native function calling now well-supported across modern local models, the proxy defaults to — and is optimized for — native tool calling, forwarding the client's OpenAI tools / messages to the backend verbatim. Prompt-injection remains available as an explicit opt-in for llama.cpp / llamafile backends that lack a function-calling template, but it is no longer the default path. This release also folds in the OpenAI-compatible client and several proxy / eval fixes that landed on main since 0.7.2.
Added
OpenAICompatClientfor arbitrary OpenAI-compatible endpoints. #89 (thanks @lucasgerads).--backend-timeoutproxy option — configurable backend response timeout (default 300s). #91.--backend-capability {native,prompt}proxy flag —native(default) forwards the client's tools / messages verbatim to a function-calling-capable backend;promptopts into prompt-injection for non-FC llama.cpp / llamafile backends. Declared once at startup and frozen — never probed or switched mid-stream.- Effective
backend_timeoutlogged at proxy startup.
Changed
- BREAKING —
--mode {native,prompt}renamed to--backend-capability {native,prompt}(andProxyServer(mode=...)→ProxyServer(backend_capability=...)).--modecollided with the proxy's managed / external deployment mode; the new name states what it controls — the backend's tool-calling protocol — and reflects that the choice is declared once and frozen, never probed at runtime. There is no deprecation alias (--modewas introduced in 0.7.1). Migration:--mode native→ drop it (native is the default) or--backend-capability native;--mode prompt→--backend-capability prompt. - Native function calling is now transparent passthrough — the proxy forwards the client's OpenAI tool / message payloads to the backend verbatim instead of round-tripping them through forge's internal
ToolSpecrepresentation, which dropped schema detail. - vLLM model identity consolidated to a single source of truth (the wire
model_pathand the registrymodelkey are now set together). #75. - The
promptcapability is now rejected loudly for ollama / vllm / anthropic backends — previously it was silently ignored for ollama. stream_optionsis excluded from proxy passthrough. #94 (thanks @alexandergunnarson).
Fixed
- Consistent malformed-tool-call / unexpected-response handling across the OpenAI-shape clients — malformed model tool args drive a retry (
TextResponse) instead of degrading silently or raising inconsistently, and non-streaming responses are guarded so a broken provider envelope fails loud. Guardrails.record()no longer drops tool args for prerequisite tracking. #72 (thanks @hobostay).- Deprecated asyncio API replaced; proxy server input validation added. #71 (thanks @hobostay).
- Proxy input hardening, non-blocking Ollama stop, client shutdown, and loud arg decode. #86.
- Dead code and a fragile variable reference cleaned up in
LlamafileClient. #73 (thanks @hobostay).
Removed
- Runtime
autofunction-calling mode inLlamafileClient— the proxy never used it, and its mid-request probe-and-switch behavior is replaced by the declared-and-frozen--backend-capability.
Contributors
Assets 2
v0.7.2: vLLM backend support
0f3e6fd [0.7.2] — 2026年05月24日
vLLM backend support — serve AWQ/GPTQ and other vLLM-hosted models behind forge's guardrails, in both proxy modes and via WorkflowRunner.
Added
- vLLM backend (
VLLMClient). OpenAI-compatible client for a vLLM server, consuming vLLM's server-sidetool_callsandreasoning(vLLM 0.21) fields. Native function calling only — vLLM parses tools server-side via--enable-auto-tool-choice --tool-call-parser, so there is no prompt-injection mode. Exported fromforgeandforge.clients. - vLLM in managed + external proxy modes.
--backend vllm --model-path <dir|hf-repo-id>launches and manages a vLLM server;--backend-url <url> --backend vllmproxies an externally managed one.setup_backend()/ServerManagergain amodel_pathparameter (the vLLM identity, distinct fromgguf_path). - vLLM served-model-name discovery in external mode. vLLM validates the request
modelfield against its--served-model-nameand 404s on a mismatch (unlike llama.cpp, which ignores the field). The proxy discovers the served name from/v1/modelsinstead of sending a placeholder. #74 (thanks @srinathh). - vLLM section in Backend Setup covering the server flags and
VLLMClientusage.
Changed
- Proxy managed mode now delegates to
setup_backend()instead of reimplementing the server-start/budget dance, so every managed backend (including vLLM) shares one path. No public API change —ProxyServerand theforge.proxyCLI keep their v0.7.1 signatures, withmodel_path/--model-pathand thevllmbackend added. - External mode fails fast when a backend reports no context length and no
--budget-tokensis set, instead of silently falling back to an 8192-token budget that could truncate context. Anthropic-protocol downstreams are unaffected.
Known limitations
- The vLLM backend is unit-validated but was not exercised against a live vLLM server in this release cycle. Its client and server-management code carry full unit coverage, and the proxy's protocol translation is verified end-to-end against llama.cpp (the proxy layer is backend-agnostic).
scripts/integration_test_proxy.py --vllm-url <url>runs the full request battery against a real vLLM server when one is available.
Assets 2
v0.7.1: Anthropic Messages API support for Claude Code + proxy hardening
f442399 [0.7.1] — 2026年05月24日
Proxy hardening: forge now works with Claude Code. First PyPI release to include the Docker, model-pass-through, and token-usage work that landed on main after v0.7.0.
Added
- Anthropic Messages API on the proxy (
POST /v1/messages). Point Claude Code — or any Anthropic-protocol client — at a forge-guarded model. Two downstream shapes: Path 2 (default,--backend-protocol openai) translates Anthropic ↔ OpenAI for local llama.cpp / Ollama and emits Anthropic SSE back; Path 1 (--backend-protocol anthropic, external mode) forwards to an Anthropic-shape downstream (LiteLLM, the Anthropic API, a self-hosted proxy), passing unknown fields through verbatim. Adds abase_urlkwarg onAnthropicClient. See the new "Using forge with Claude Code" section in the User Guide. --mode {native,prompt}proxy flag — run prompt-injected function-calling through the proxy for OpenAI-compatible backends that lack a native tool-calling template, not just native FC. Closes #53.- Real token-usage reporting through the proxy — responses carry actual prompt/completion counts (previously hardcoded zeros), in both OpenAI (
usage.prompt_tokens/...) and Anthropic (usage.input_tokens/output_tokens) shapes, streaming and non-streaming. #81 (thanks @mhajder). - Per-request model-name pass-through for external backends — the proxy honors the inbound
modelagainst external OpenAI-compatible backends. #80 (thanks @mhajder). - Dockerfile for running the proxy as a container. #79 (thanks @mhajder).
Changed
last_usageunified on slot-keyed{slot_id: TokenUsage}across all clients.AnthropicClientpreviously stored a flat{input_tokens, output_tokens}dict; it now uses the slot-0 conventionLlamafileClient/OllamaClientalready follow, so usage extraction has one contract.- Inbound
modelrides the proxy's passthrough/extras channel rather than the sampling map — a cleaner replacement for the #80 mechanism that keepsmodelout ofsampling.
Fixed
- Proxy no longer hard-imports the optional
anthropicSDK at load. A plainforge-guardrailsinstall (without the[anthropic]extra) can now start the proxy for local / OpenAI-shape backends; the SDK is imported lazily and only required for--backend-protocol anthropic. - Proxy router tolerates query strings. Requests like Claude Code's
POST /v1/messages?beta=trueroute correctly instead of returning 404. eval_runnertoken accounting for local backends — was silently counting zero tokens because it read the flatlast_usagekeys; now reads the slot-keyedTokenUsage(fixed by the unification above).
Known limitations
cache_controlis not preserved on Path 2. OpenAI Chat Completions has no analog, so prompt-cache hints are dropped when the downstream is a local OpenAI-shape backend. Path 1 (Anthropic-shape downstream) preservescache_controlon clean turns. See ADR-015.- Prompt-mode multi-turn tool convergence is model-dependent. Some models reliably consume prompt-injected tool results across turns; others re-call the same tool. Native FC is the more robust default for heavy multi-turn tool use (e.g. Claude Code).
Assets 2
v0.7.0: lineup refresh + tool-error channel + docs + MODEL_REGISTRY
655e1f6 [0.7.0] — 2026年05月22日
Added
- Granite 4.1 8B + Gemma-4-E4B + phi-4 — added to the eval lineup. Granite 4.1 mirrors the IBM greedy-decoding convention pending formal published sampling guidance; phi-4 has no formal sampling recommendation and falls through to backend defaults.
_PROMPT_ONLY_MODELSinbatch_eval— skips native FC for models lacking training for the OpenAItool_callsschema (currently: phi-4, verified via curl 2026年05月14日)._NO_RECOMMENDED_SAMPLING_MODELSinbatch_eval— runsrecommended_sampling=Falsefor models without formal sampling guidance from any official source, so the eval doesn't raiseUnsupportedModelErroron them.MODEL_REGISTRY.md— new doc enumerating every model forge knows about, classified as Current (in v0.7.0 eval), Retired (cut from current eval), or Unpublished (sampling params staged, no published eval). Sampling values, source links, identity-key conventions.- Versioned eval datasets — committed dataset files renamed to
eval_results_vX.Y.Z.jsonl. Prior versions kept in LFS for reproducibility. report.py--html+--markdownflags surfaced in README and EVAL_GUIDE examples.
Changed
- Step enforcement + prerequisite violations surface on the tool channel. Previously,
WorkflowRunneremitted these as trailingrole="user"nudges after the assistanttool_call. v0.7.0 emits onerole="tool"message per blocked call with[StepEnforcementError]/[PrereqError]prefixes — the canonical "tool call failed, try again" wire shape OpenAI-tool-trained models are pretrained on. Surfaced by v4 forge-code dogfooding (gpt-oss-120b reliably exhausted prerequisite-violation budget under the old shape). - Unknown-tool retry on the tool channel. Same refactor applied to
ResponseValidatorunknown-tool path:[UnknownToolError]tool-error reply instead of a user nudge. - Eval lineup refresh — cut Llama 3.1 8B, Mistral 7B v0.3, Mistral Nemo 12B, Granite 4.0 (h-micro / h-tiny). All scored bare <30% on the v0.6.0 dataset — too weak to be informative, superseded by Ministral-3 / Granite 4.1 / phi-4. Sampling defaults retained in
sampling_defaults.pyfor backward compatibility (see MODEL_REGISTRY Retired tier). - Eval dataset —
eval_results_v0.7.0.jsonl(96,200 rows, 74 cells; rig-01). Apples-to-apples delta on 21 common configs vs v0.6.0: +0.7pt overall, -1.2pt advanced_reasoning — both within CI. Published-leaderboard floor lifts +16.9pt via composition (weak-model cuts). - Dashboard + markdown views regenerated against v0.7.0 dataset. Top of leaderboard reshuffled: Ministral-3 14B Reasoning Q4 LS/N now #1 at 84.5% (was Ministral-3 8B Instruct Q8 LS/P at 86.5% in v0.6.0; now #3 at 84.4%).
- MODEL_GUIDE rewrite — trimmed to opinions + rationale (333 → 145 lines). Full leaderboard, OG-18 100% list, hard suite top-5, models-to-avoid tables moved to the dashboard / markdown views. Sampling-parameters and "backend matters" sections retained. Native-vs-prompt heuristic corrected: not workload-driven, sensitivity is per-family.
- ARCHITECTURE rebuild — cut signature restating (1701 → 165 lines); the doc now covers design principles, surface modes, guardrail rationale, compaction priority rationale, respond-tool rationale, sampling opt-in semantics. Source is authoritative for class signatures; WORKFLOW.md owns the diagrams; ADRs own past decisions.
- BACKEND_SETUP rewrite — cut model-pick prose, Windows-specific install steps, Ollama Modelfile tutorial, llamafile distribution explainer, per-backend "run the eval" subsections, VRAM tables (360 → 135 lines). Per-backend section now: boot command + flag table + curl smoke-test + forge client snippet. Added Anthropic section using
pip install "forge-guardrails[anthropic]". - README opener — leads with the contract (any tools, any order; structure opt-in via
required_steps/prerequisites/terminal_tool) before the eval pitch. New "What forge isn't" (not an agent orchestrator, not a coding harness) preempts the conflations that surfaced on HN. Three-ways list reordered with proxy first (most popular entry point). Quick Start swapped from Ollama to llama-server.
Fixed
- WorkflowRunner docstring + tree — added missing
retry_nudgekwarg,cancel_eventparameter onrun(),PREREQUISITE_NUDGE+CONTEXT_WARNINGmessage types,MaxIterationsError/PrerequisiteError/StepEnforcementError/WorkflowCancelledErrorin Raises lists across docs. - CompactStrategy + ContextManager signatures in docs —
trigger_tokens→budget_tokens(the strategy owns its own threshold logic now);compact_threshold→context_thresholds+on_context_thresholdcallbacks. LlamafileClientconstructor docs — added missing sampling kwargs (top_p,top_k,min_p,repeat_penalty,presence_penalty),chat_template_kwargs,slot_id.- MODEL_FAMILIES in
report.py— added entries forgranite-4.1-8b(Q4/Q8) andphi-4-Q4_K_Mso cross-backend rollups inby-backend.mdgroup these new models correctly. - WORKFLOW.md agentic-loop flowchart — node names + edges updated to reflect the tool-error wire shape (
STEP_TOOL_ERROR,PREREQ_TOOL_ERROR,UNKNOWN_TOOL_ERROR); compaction-priority table fixed (step_nudgeandprerequisite_nudgearerole=tool,retry_nudgeremainsrole=user). - Stale
bfcl/reference removed from WORKFLOW.md module diagram (directory was removed pre-v0.7.0; ADR-009 retained as historical artifact).
Known limitations
- Anthropic numbers not re-measured in v0.7.0. The Anthropic ablation matrix (~272ドル to run) was not re-executed for v0.7.0. Numbers cited in any v0.7.0 doc are from the v0.6.0 dataset (
eval_results_v0.6.0.jsonl). Tool-error-channel changes affect frontier models' wire on guardrail-fire paths too, but expected movement is small.