-
Notifications
You must be signed in to change notification settings - Fork 229
feat: add KB mutation locks and atomic state writes#86
Conversation
@KylinMountain
KylinMountain
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review: KB mutation locks & atomic state writes
The lock + atomic-write primitives are well built (clean reentrancy accounting, correct os.replace pattern, good tests). My concern is mostly where the locks are applied — a couple of high-impact gaps make the protection ineffective or actively disruptive in the most common usage. Findings ranked by severity.
🔴 1. watch holds the exclusive lock for its entire lifetime
watch is decorated @_with_kb_lock(exclusive=True), but watch_directory() blocks indefinitely (while observer.is_alive(): observer.join(...)). So the exclusive lock is held for the whole watch session — every other command, including read-only list/status (shared lock, blocked by the exclusive holder), is locked out until watch is killed.
→ Take the lock per-ingestion inside on_new_files (around each add_single_file), not around the whole command.
🔴 2. chat bypasses the lock protocol entirely
chat has no _with_kb_lock decorator, yet its REPL mutates the KB via /add (add_single_file, chat.py:495/500) and /lint (run_lint, chat.py:752). Those functions hold no internal lock — locking lives only in the CLI decorator. So a /add inside an open chat session races freely against a concurrent openkb add (which did take the exclusive lock): the two ingests interleave and one's concept-page / hashes.json writes silently clobber the other's. The exclusive lock on CLI add is worthless against this path.
→ Either lock the chat mutation operations, or push locking down into add_single_file/run_lint.
🟠 3. register_kb global-config read-modify-write is unguarded (wrong lock altitude)
register_kb does load_global_config() → append → save_global_config() on the shared ~/.config/openkb/global.yaml. The new locks are per-KB (.openkb/ingest.lock), and init/use aren't decorated at all. atomic_write_text makes each write un-torn but does not serialize the load→append→save sequence. Two concurrent init/use in different KBs → both read the same known_kbs, each appends its own path, last writer wins → the other registration (and default_kb) is silently lost. This is exactly the lost-update the PR title implies it prevents, but the lock is scoped to the wrong file. Consider a global-config lock (or pushing the lock inside register_kb).
🟠 4. lint holds the exclusive lock across a multi-minute LLM run
lint is exclusive, but without --fix it's read-only, and run_lint invokes an LLM knowledge-lint agent that can run for minutes — all while holding the write lock. Every list/status from another process blocks for that whole window. Use a shared lock for read-only lint, or take the write lock only around the --fix link rewrite and release before the LLM phase.
🟡 5. atomic_write_* never fsyncs the parent directory after os.replace
The temp file is fsync'd, but the directory entry (the rename) is not. On ext4/xfs a crash right after os.replace can lose the rename while the data survives — a reader after reboot sees the old content even though the call "succeeded". Atomicity holds; durability of the commit doesn't. Add an os.open(parent, O_DIRECTORY) + os.fsync after replace. (atomic_write_binary has the same gap.)
🟡 6. mkstemp silently tightens file permissions to 0600
tempfile.mkstemp creates the temp at 0600 and os.replace carries that onto the destination, ignoring umask. A config.yaml that was 0644 becomes owner-only after the first save_config/save_global_config rewrite, and every new file is 0600 regardless of umask — a behavior change from the old path.open("w"). chmod the temp to match the existing file's mode (or to 0o666 & ~umask) before replace.
🟡 7. Shared read lock is serialized within a process
The per-path RLock is held for the whole lock section, so two threads in the same process can never hold LOCK_SH concurrently — first-acquire from a second thread blocks on the RLock before reaching the OS shared lock. Cross-process readers are fine, so impact is limited today, but the "shared read lock" doesn't deliver shared semantics in-process (relevant if chat/agent ever reads concurrently across threads).
🟡 8. fcntl.flock is silently ineffective on network filesystems
KBs are often kept on NFS / Dropbox / iCloud Drive. flock may be a no-op there; the code records the lock as held and proceeds with no detection, so two hosts can mutate the same KB concurrently and corrupt state. Worth documenting the limitation, or detecting non-local backing stores.
⚪ 9. Dead code + divergent KB-detection predicate
maybe_kb_ingest_lock, maybe_kb_read_lock, and atomic_write_binary are defined but referenced nowhere. The decorator reimplements "lock only if a KB exists" inline using a different predicate (_find_kb_dir(...) is None) than maybe_* ((kb_dir/'.openkb').is_dir()). A future caller reaching for the conveniently-named maybe_* would get different behavior (no cwd-walk / global-default resolution) and could lock the wrong directory. Suggest deleting the unused helpers and keeping one source of truth.
⚪ 10. hashes.json format flip on first write
atomic_write_json appends a trailing newline, so _persist writes {...}\n, but init seeds the file via json.dumps({}) with no newline. The byte format flips on the first add() (spurious diff / checksum mismatch). Route the init seed through atomic_write_json too. (Minor related note: failed writes under SIGKILL/ENOSPC can leave hidden .<name>.*.tmp orphans in .openkb/ that nothing sweeps.)
#1 and #2 are the ones I'd block on — they make the locking ineffective (chat) or actively disruptive (watch) in normal use. The rest are incremental hardening.
cnndabbler
commented
Jun 9, 2026
Tested this together with #15 (SQLite registry) on current main: they merge cleanly (both touch state.py/cli.py but in non-overlapping regions) and the full suite passes — 705 passed on the combined branch. Also ran it through a real 165-doc ingest. The atomic writes + ingest.lock fixed a hashes.json/index.md read-modify-write race I hit when two ingest processes touched the same KB. 👍
5c4cfbc to
563f3a9
Compare
gwokhou
commented
Jun 14, 2026
@KylinMountain Thanks for the review. I updated the PR as follows:
- Fixed:
watchno longer holds the lock for the full lifetime. - Fixed: chat
/add,/lint,/save,/status,/list, and query logging now use
the lock protocol. - Fixed:
register_kbglobal config RMW is now locked. - Fixed:
lintno longer holds an exclusive lock across the LLM run. - Fixed: atomic replace now fsyncs the parent directory.
- Fixed: atomic writes now preserve/derive file mode correctly.
- Fixed: in-process shared reads can now run concurrently.
- Documented:
fcntl.flockis advisory/local-FS only; network/synced FS is not
guaranteed. - Addressed by scope: removed the unused speculative helpers from this PR (will use in future PRs).
- Fixed:
hashes.jsoninit now uses the same atomic JSON writer.
Verified with:
uv run pytest tests/test_locks.py tests/test_config.py tests/test_cli.py tests/ test_lint_cli.py
Tested this together with #15 (SQLite registry) on current
main: they merge cleanly (both touchstate.py/cli.pybut in non-overlapping regions) and the full suite passes — 705 passed on the combined branch. Also ran it through a real 165-doc ingest. The atomic writes +ingest.lockfixed ahashes.json/index.mdread-modify-write race I hit when two ingest processes touched the same KB. 👍
@cnndabbler Thanks for testing this with #15 and a real 165-doc ingest. Very helpful validation. I’ve rebased and updated this PR while keeping the scope limited to the locking/atomic-write boundary, so #15 should still be able to layer on top cleanly.
KylinMountain
commented
Jun 14, 2026
Thanks for the quick turnaround, @gwokhou — all 10 review points are addressed and the lock / atomic-write implementation looks solid. Since #96 (the doc_name collision fix) just landed on main, your branch picked up a small conflict in openkb/cli.py, so I merged latest main in and resolved it for you (_registry_path now sits alongside your locks imports, and #96's add_single_file changes live in your renamed _add_single_file_locked). Full suite green —728 passed. 🙏
I'll go ahead and merge this PR — really appreciate your contribution!
@KylinMountain
KylinMountain
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Uh oh!
There was an error while loading. Please reload this page.
Summary
This PR adds cooperative KB-level locking and atomic state/config writes.
It is a reliability hardening change for the current CLI/chat flows and a foundation for future concurrent ingestion / compilation work. The scope is intentionally limited to synchronization and safer persistence; concurrent ingestion and transactional recovery are left for follow-up PRs.
Why
Several OpenKB commands mutate shared KB files such as
.openkb/hashes.json,.openkb/config.yaml,wiki/,raw/, and generated reports. If two OpenKB processes modify the same KB at the same time, or if a process is interrupted while writing state, the KB can be left in a partially written or inconsistent state.This matters today for commands such as
openkb add,openkb remove,openkb recompile,openkb watch,openkb lint --fix, and matching chat slash commands.Changes
openkb.lockswith KB-level advisory locks:fcntl.flockprotocol.os.replace.hashes.jsonseeding duringopenkb initadd_single_filetakes the mutation lock, so CLIadd,watch, and chat/addare covered without locking the full watch lifetime.lint --fixtakes the mutation lock only while applying fixes.run_linttakes a shared read lock while reading/running lint and a mutation lock only when writing the report/log./save,/status,/list, and query logging use the appropriate locks.Notes
The lock is advisory and cooperative: it protects OpenKB commands that use these helpers. It does not make arbitrary external edits to the KB directory safe, and it does not guarantee cross-host safety on networked or synced filesystems where
fcntl.flockmay be unavailable or inconsistent.Testing
uv run pytest tests/test_locks.py tests/test_config.py tests/test_cli.py tests/test_lint_cli.py