fix: recover stale outbound.db journals after container kills; classify hot-journal poll races (#2516, #2640)#2750

Open

sturdy4days wants to merge 1 commit into

nanocoai:main from

sturdy4days:fix/outbound-journal-recovery

Open

fix: recover stale outbound.db journals after container kills; classify hot-journal poll races (#2516, #2640) #2750
sturdy4days wants to merge 1 commit into
nanocoai:main from
sturdy4days:fix/outbound-journal-recovery

Conversation

@sturdy4days

@sturdy4days sturdy4days commented Jun 12, 2026

Copy link

Copy Markdown

Fixes #2516, fixes #2640 — two related failure modes of the host's READONLY outbound.db handles. Both issues diagnosed the mechanism correctly and proposed the fix shapes adopted here; credit to their reporters.

#2516 — stale journal after a container SIGKILL

A ceiling/claim-stuck kill (or a bare restart kill) can land mid-transaction, stranding a hot outbound.db-journal. The host's readonly delivery handle cannot perform hot-journal recovery, so every poll errors until the next container spawn opens the file read-write — potentially hours away on a quiet session.

Fix: recoverOutboundJournal(agentGroupId, sessionId) in session-manager.ts — if a journal exists, a brief read-write open triggers SQLite's rollback + journal deletion. Wired as killContainer's onExit (so it only ever runs after the container's exit is confirmed) on:

both host-sweep.ts kill paths (absolute-ceiling, claim-stuck)
the bare kill in container-restart.ts (the respawn path doesn't need it — the fresh container recovers the journal itself)

#2640 — transient hot-journal race on live polls

With the container alive and committing (journal_mode=DELETE), a readonly poll landing inside the brief commit window throws SQLITE_READONLY_ROLLBACK ("attempt to write a readonly database"); busy_timeout doesn't apply because it isn't BUSY. The race self-heals on the next poll tick, but it logged at ERROR — 63 occurrences in one of our field logs.

Fix: isReadonlyRollbackError() classifies the SQLITE_READONLY_* family as the transient race at the drainSession read (early-return, next tick retries) and in both poll catch-alls, downgrading to debug.

Tests

A genuine hot-journal simulation: a spilled uncommitted transaction's db + journal bytes are captured mid-transaction and byte-restored after close — exactly what a SIGKILL mid-write leaves on disk. The test proves the readonly handle fails on it and the read-write open recovers (rollback applied, journal deleted, readonly reads work again).
5 discriminator cases for the error classifier (READONLY family matched, SQLITE_BUSY/plain errors/non-errors rejected).
Existing kill-signature assertions updated for the new onExit argument.

Full suite: 461/461 passing. Both fixes have been running on a production install (the #2516 no-claims gap and the #2640 error noise were both reproduced there first).

🤖 Generated with Claude Code

@sturdy4days @claude


 fix: recover stale outbound.db journals after container kills; classi...

3c5c58f

...fy hot-journal poll races (nanocoai#2516, nanocoai#2640)
Two related failure modes of the host's READONLY outbound.db handles,
both reported with correct diagnoses in the issues:
- nanocoai#2516: a SIGKILLed container (ceiling/claim-stuck kill, restart) can
 strand a hot outbound.db-journal. A readonly handle cannot perform
 hot-journal recovery, so every delivery poll errors until the NEXT
 container spawn opens the file read-write — potentially hours away.
 Fix: recoverOutboundJournal() (brief read-write open → SQLite rollback
 + journal deletion), wired as killContainer's onExit on both
 host-sweep kill paths and the bare restart kill. The respawn paths
 don't need it — the fresh container recovers the journal itself.
- nanocoai#2640: with the container alive, a readonly poll landing inside a
 live DELETE-mode commit window throws SQLITE_READONLY_ROLLBACK
 ("attempt to write a readonly database"); busy_timeout doesn't apply.
 Self-healing within one poll tick, but it logged at ERROR (63
 occurrences in one field log). Fix: isReadonlyRollbackError()
 classifies the SQLITE_READONLY_* family as the transient race at the
 drainSession read and in both poll catch-alls, downgrading to debug.
Tests: a hot-journal simulation (spilled uncommitted transaction,
byte-restored after close) proving the readonly handle fails and the
read-write open recovers; 5 discriminator cases for the error
classifier. Existing kill-signature assertions updated for the new
onExit argument.
Credit to the reporters of nanocoai#2516 and nanocoai#2640 — both fixes follow the
approaches proposed and field-tested in those issues.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>