Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

fix: recover stale outbound.db journals after container kills; classify hot-journal poll races (#2516, #2640)#2750

Open
sturdy4days wants to merge 1 commit into
nanocoai:main from
sturdy4days:fix/outbound-journal-recovery
Open

fix: recover stale outbound.db journals after container kills; classify hot-journal poll races (#2516, #2640) #2750
sturdy4days wants to merge 1 commit into
nanocoai:main from
sturdy4days:fix/outbound-journal-recovery

Conversation

@sturdy4days

@sturdy4days sturdy4days commented Jun 12, 2026

Copy link
Copy Markdown

Fixes #2516, fixes #2640 — two related failure modes of the host's READONLY outbound.db handles. Both issues diagnosed the mechanism correctly and proposed the fix shapes adopted here; credit to their reporters.

#2516 — stale journal after a container SIGKILL

A ceiling/claim-stuck kill (or a bare restart kill) can land mid-transaction, stranding a hot outbound.db-journal. The host's readonly delivery handle cannot perform hot-journal recovery, so every poll errors until the next container spawn opens the file read-write — potentially hours away on a quiet session.

Fix: recoverOutboundJournal(agentGroupId, sessionId) in session-manager.ts — if a journal exists, a brief read-write open triggers SQLite's rollback + journal deletion. Wired as killContainer's onExit (so it only ever runs after the container's exit is confirmed) on:

  • both host-sweep.ts kill paths (absolute-ceiling, claim-stuck)
  • the bare kill in container-restart.ts (the respawn path doesn't need it — the fresh container recovers the journal itself)

#2640 — transient hot-journal race on live polls

With the container alive and committing (journal_mode=DELETE), a readonly poll landing inside the brief commit window throws SQLITE_READONLY_ROLLBACK ("attempt to write a readonly database"); busy_timeout doesn't apply because it isn't BUSY. The race self-heals on the next poll tick, but it logged at ERROR — 63 occurrences in one of our field logs.

Fix: isReadonlyRollbackError() classifies the SQLITE_READONLY_* family as the transient race at the drainSession read (early-return, next tick retries) and in both poll catch-alls, downgrading to debug.

Tests

  • A genuine hot-journal simulation: a spilled uncommitted transaction's db + journal bytes are captured mid-transaction and byte-restored after close — exactly what a SIGKILL mid-write leaves on disk. The test proves the readonly handle fails on it and the read-write open recovers (rollback applied, journal deleted, readonly reads work again).
  • 5 discriminator cases for the error classifier (READONLY family matched, SQLITE_BUSY/plain errors/non-errors rejected).
  • Existing kill-signature assertions updated for the new onExit argument.

Full suite: 461/461 passing. Both fixes have been running on a production install (the #2516 no-claims gap and the #2640 error noise were both reproduced there first).

🤖 Generated with Claude Code

...fy hot-journal poll races (nanocoai#2516, nanocoai#2640)
Two related failure modes of the host's READONLY outbound.db handles,
both reported with correct diagnoses in the issues:
- nanocoai#2516: a SIGKILLed container (ceiling/claim-stuck kill, restart) can
 strand a hot outbound.db-journal. A readonly handle cannot perform
 hot-journal recovery, so every delivery poll errors until the NEXT
 container spawn opens the file read-write — potentially hours away.
 Fix: recoverOutboundJournal() (brief read-write open → SQLite rollback
 + journal deletion), wired as killContainer's onExit on both
 host-sweep kill paths and the bare restart kill. The respawn paths
 don't need it — the fresh container recovers the journal itself.
- nanocoai#2640: with the container alive, a readonly poll landing inside a
 live DELETE-mode commit window throws SQLITE_READONLY_ROLLBACK
 ("attempt to write a readonly database"); busy_timeout doesn't apply.
 Self-healing within one poll tick, but it logged at ERROR (63
 occurrences in one field log). Fix: isReadonlyRollbackError()
 classifies the SQLITE_READONLY_* family as the transient race at the
 drainSession read and in both poll catch-alls, downgrading to debug.
Tests: a hot-journal simulation (spilled uncommitted transaction,
byte-restored after close) proving the readonly handle fails and the
read-write open recovers; 5 discriminator cases for the error
classifier. Existing kill-signature assertions updated for the new
onExit argument.
Credit to the reporters of nanocoai#2516 and nanocoai#2640 — both fixes follow the
approaches proposed and field-tested in those issues.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

@gavrielc gavrielc Awaiting requested review from gavrielc gavrielc is a code owner

@gabi-simons gabi-simons Awaiting requested review from gabi-simons gabi-simons is a code owner

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

1 participant

AltStyle によって変換されたページ (->オリジナル) /