-
Notifications
You must be signed in to change notification settings - Fork 12.9k
fix: recover stale outbound.db journals after container kills; classify hot-journal poll races (#2516, #2640)#2750
Open
sturdy4days wants to merge 1 commit into
Open
Conversation
...fy hot-journal poll races (nanocoai#2516, nanocoai#2640) Two related failure modes of the host's READONLY outbound.db handles, both reported with correct diagnoses in the issues: - nanocoai#2516: a SIGKILLed container (ceiling/claim-stuck kill, restart) can strand a hot outbound.db-journal. A readonly handle cannot perform hot-journal recovery, so every delivery poll errors until the NEXT container spawn opens the file read-write — potentially hours away. Fix: recoverOutboundJournal() (brief read-write open → SQLite rollback + journal deletion), wired as killContainer's onExit on both host-sweep kill paths and the bare restart kill. The respawn paths don't need it — the fresh container recovers the journal itself. - nanocoai#2640: with the container alive, a readonly poll landing inside a live DELETE-mode commit window throws SQLITE_READONLY_ROLLBACK ("attempt to write a readonly database"); busy_timeout doesn't apply. Self-healing within one poll tick, but it logged at ERROR (63 occurrences in one field log). Fix: isReadonlyRollbackError() classifies the SQLITE_READONLY_* family as the transient race at the drainSession read and in both poll catch-alls, downgrading to debug. Tests: a hot-journal simulation (spilled uncommitted transaction, byte-restored after close) proving the readonly handle fails and the read-write open recovers; 5 discriminator cases for the error classifier. Existing kill-signature assertions updated for the new onExit argument. Credit to the reporters of nanocoai#2516 and nanocoai#2640 — both fixes follow the approaches proposed and field-tested in those issues. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@sturdy4days
sturdy4days
requested review from
gabi-simons and
gavrielc
as code owners
June 12, 2026 14:49
This was referenced Jun 13, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #2516, fixes #2640 — two related failure modes of the host's READONLY
outbound.dbhandles. Both issues diagnosed the mechanism correctly and proposed the fix shapes adopted here; credit to their reporters.#2516 — stale journal after a container SIGKILL
A ceiling/claim-stuck kill (or a bare restart kill) can land mid-transaction, stranding a hot
outbound.db-journal. The host's readonly delivery handle cannot perform hot-journal recovery, so every poll errors until the next container spawn opens the file read-write — potentially hours away on a quiet session.Fix:
recoverOutboundJournal(agentGroupId, sessionId)insession-manager.ts— if a journal exists, a brief read-write open triggers SQLite's rollback + journal deletion. Wired askillContainer'sonExit(so it only ever runs after the container's exit is confirmed) on:host-sweep.tskill paths (absolute-ceiling,claim-stuck)container-restart.ts(the respawn path doesn't need it — the fresh container recovers the journal itself)#2640 — transient hot-journal race on live polls
With the container alive and committing (journal_mode=DELETE), a readonly poll landing inside the brief commit window throws
SQLITE_READONLY_ROLLBACK("attempt to write a readonly database");busy_timeoutdoesn't apply because it isn't BUSY. The race self-heals on the next poll tick, but it logged at ERROR — 63 occurrences in one of our field logs.Fix:
isReadonlyRollbackError()classifies theSQLITE_READONLY_*family as the transient race at thedrainSessionread (early-return, next tick retries) and in both poll catch-alls, downgrading to debug.Tests
db+journalbytes are captured mid-transaction and byte-restored after close — exactly what a SIGKILL mid-write leaves on disk. The test proves the readonly handle fails on it and the read-write open recovers (rollback applied, journal deleted, readonly reads work again).SQLITE_BUSY/plain errors/non-errors rejected).onExitargument.Full suite: 461/461 passing. Both fixes have been running on a production install (the #2516 no-claims gap and the #2640 error noise were both reproduced there first).
🤖 Generated with Claude Code