feat: add KB mutation locks and atomic state writes#86

Merged

KylinMountain merged 4 commits into

VectifyAI:main from

gwokhou:pr/kb-mutation-locking

Jun 14, 2026

Merged

feat: add KB mutation locks and atomic state writes #86
KylinMountain merged 4 commits into
VectifyAI:main from
gwokhou:pr/kb-mutation-locking

Conversation

@gwokhou

@gwokhou gwokhou commented Jun 3, 2026 •

edited

Loading

Copy link

Copy Markdown

Contributor

Summary

This PR adds cooperative KB-level locking and atomic state/config writes.

It is a reliability hardening change for the current CLI/chat flows and a foundation for future concurrent ingestion / compilation work. The scope is intentionally limited to synchronization and safer persistence; concurrent ingestion and transactional recovery are left for follow-up PRs.

Why

Several OpenKB commands mutate shared KB files such as .openkb/hashes.json, .openkb/config.yaml, wiki/, raw/, and generated reports. If two OpenKB processes modify the same KB at the same time, or if a process is interrupted while writing state, the KB can be left in a partially written or inconsistent state.

This matters today for commands such as openkb add, openkb remove, openkb recompile, openkb watch, openkb lint --fix, and matching chat slash commands.

Changes

Add openkb.locks with KB-level advisory locks:
- exclusive locks for mutation sections
- shared locks for read-only sections
- reentrant same-thread lock handling
- in-process concurrent shared readers
- explicit rejection of read-to-write lock upgrades
Document the local-filesystem/advisory nature of the fcntl.flock protocol.
Add atomic write helpers for text and JSON writes.
Preserve target file permissions on atomic replace and fsync the parent directory after os.replace.
Use atomic writes for:
- KB config writes
- global config writes
- hash registry persistence
- initial hashes.json seeding during openkb init
Guard global config read-modify-write with a global config lock.
Apply KB locks at the operation boundary:
- add_single_file takes the mutation lock, so CLI add, watch, and chat /add are covered without locking the full watch lifetime.
- lint --fix takes the mutation lock only while applying fixes.
- run_lint takes a shared read lock while reading/running lint and a mutation lock only when writing the report/log.
- chat /save, /status, /list, and query logging use the appropriate locks.
Keep unused speculative helpers out of this PR.

Notes

The lock is advisory and cooperative: it protects OpenKB commands that use these helpers. It does not make arbitrary external edits to the KB directory safe, and it does not guarantee cross-host safety on networked or synced filesystems where fcntl.flock may be unavailable or inconsistent.

Testing

uv run pytest tests/test_locks.py tests/test_config.py tests/test_cli.py tests/test_lint_cli.py

KylinMountain

KylinMountain reviewed

Jun 8, 2026

View reviewed changes

@KylinMountain KylinMountain left a comment

Copy link

Copy Markdown

Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: KB mutation locks & atomic state writes

The lock + atomic-write primitives are well built (clean reentrancy accounting, correct os.replace pattern, good tests). My concern is mostly where the locks are applied — a couple of high-impact gaps make the protection ineffective or actively disruptive in the most common usage. Findings ranked by severity.

🔴 1. `watch` holds the exclusive lock for its entire lifetime

watch is decorated @_with_kb_lock(exclusive=True), but watch_directory() blocks indefinitely (while observer.is_alive(): observer.join(...)). So the exclusive lock is held for the whole watch session — every other command, including read-only list/status (shared lock, blocked by the exclusive holder), is locked out until watch is killed.
→ Take the lock per-ingestion inside on_new_files (around each add_single_file), not around the whole command.

🔴 2. `chat` bypasses the lock protocol entirely

chat has no _with_kb_lock decorator, yet its REPL mutates the KB via /add (add_single_file, chat.py:495/500) and /lint (run_lint, chat.py:752). Those functions hold no internal lock — locking lives only in the CLI decorator. So a /add inside an open chat session races freely against a concurrent openkb add (which did take the exclusive lock): the two ingests interleave and one's concept-page / hashes.json writes silently clobber the other's. The exclusive lock on CLI add is worthless against this path.
→ Either lock the chat mutation operations, or push locking down into add_single_file/run_lint.

🟠 3. `register_kb` global-config read-modify-write is unguarded (wrong lock altitude)

register_kb does load_global_config() → append → save_global_config() on the shared ~/.config/openkb/global.yaml. The new locks are per-KB (.openkb/ingest.lock), and init/use aren't decorated at all. atomic_write_text makes each write un-torn but does not serialize the load→append→save sequence. Two concurrent init/use in different KBs → both read the same known_kbs, each appends its own path, last writer wins → the other registration (and default_kb) is silently lost. This is exactly the lost-update the PR title implies it prevents, but the lock is scoped to the wrong file. Consider a global-config lock (or pushing the lock inside register_kb).

🟠 4. `lint` holds the exclusive lock across a multi-minute LLM run

lint is exclusive, but without --fix it's read-only, and run_lint invokes an LLM knowledge-lint agent that can run for minutes — all while holding the write lock. Every list/status from another process blocks for that whole window. Use a shared lock for read-only lint, or take the write lock only around the --fix link rewrite and release before the LLM phase.

🟡 5. `atomic_write_*` never fsyncs the parent directory after `os.replace`

The temp file is fsync'd, but the directory entry (the rename) is not. On ext4/xfs a crash right after os.replace can lose the rename while the data survives — a reader after reboot sees the old content even though the call "succeeded". Atomicity holds; durability of the commit doesn't. Add an os.open(parent, O_DIRECTORY) + os.fsync after replace. (atomic_write_binary has the same gap.)

🟡 6. `mkstemp` silently tightens file permissions to 0600

tempfile.mkstemp creates the temp at 0600 and os.replace carries that onto the destination, ignoring umask. A config.yaml that was 0644 becomes owner-only after the first save_config/save_global_config rewrite, and every new file is 0600 regardless of umask — a behavior change from the old path.open("w"). chmod the temp to match the existing file's mode (or to 0o666 & ~umask) before replace.

🟡 7. Shared read lock is serialized within a process

The per-path RLock is held for the whole lock section, so two threads in the same process can never hold LOCK_SH concurrently — first-acquire from a second thread blocks on the RLock before reaching the OS shared lock. Cross-process readers are fine, so impact is limited today, but the "shared read lock" doesn't deliver shared semantics in-process (relevant if chat/agent ever reads concurrently across threads).

🟡 8. `fcntl.flock` is silently ineffective on network filesystems

KBs are often kept on NFS / Dropbox / iCloud Drive. flock may be a no-op there; the code records the lock as held and proceeds with no detection, so two hosts can mutate the same KB concurrently and corrupt state. Worth documenting the limitation, or detecting non-local backing stores.

⚪ 9. Dead code + divergent KB-detection predicate

maybe_kb_ingest_lock, maybe_kb_read_lock, and atomic_write_binary are defined but referenced nowhere. The decorator reimplements "lock only if a KB exists" inline using a different predicate (_find_kb_dir(...) is None) than maybe_* ((kb_dir/'.openkb').is_dir()). A future caller reaching for the conveniently-named maybe_* would get different behavior (no cwd-walk / global-default resolution) and could lock the wrong directory. Suggest deleting the unused helpers and keeping one source of truth.

⚪ 10. `hashes.json` format flip on first write

atomic_write_json appends a trailing newline, so _persist writes {...}\n, but init seeds the file via json.dumps({}) with no newline. The byte format flips on the first add() (spurious diff / checksum mismatch). Route the init seed through atomic_write_json too. (Minor related note: failed writes under SIGKILL/ENOSPC can leave hidden .<name>.*.tmp orphans in .openkb/ that nothing sweeps.)

#1 and #2 are the ones I'd block on — they make the locking ineffective (chat) or actively disruptive (watch) in normal use. The rest are incremental hardening.

@cnndabbler

cnndabbler commented Jun 9, 2026

Copy link

Copy Markdown

Contributor

Tested this together with #15 (SQLite registry) on current main: they merge cleanly (both touch state.py/cli.py but in non-overlapping regions) and the full suite passes — 705 passed on the combined branch. Also ran it through a real 165-doc ingest. The atomic writes + ingest.lock fixed a hashes.json/index.md read-modify-write race I hit when two ingest processes touched the same KB. 👍

@cnndabbler cnndabbler mentioned this pull request

Jun 9, 2026

Add SQLite-backed registry with JSON migration support #15

Open

gwokhou added 3 commits

June 14, 2026 16:42

@gwokhou


 feat: add KB lock primitives

d1891ce

@gwokhou


 chore: make KB state writes atomic

b083f79

@gwokhou


 feat: apply KB locks to CLI commands

563f3a9

@gwokhou gwokhou force-pushed the pr/kb-mutation-locking branch from 5c4cfbc to 563f3a9 Compare

June 14, 2026 08:42

@gwokhou

gwokhou commented Jun 14, 2026

Copy link

Copy Markdown

Contributor Author

@KylinMountain Thanks for the review. I updated the PR as follows:

Fixed: watch no longer holds the lock for the full lifetime.
Fixed: chat /add, /lint, /save, /status, /list, and query logging now use
the lock protocol.
Fixed: register_kb global config RMW is now locked.
Fixed: lint no longer holds an exclusive lock across the LLM run.
Fixed: atomic replace now fsyncs the parent directory.
Fixed: atomic writes now preserve/derive file mode correctly.
Fixed: in-process shared reads can now run concurrently.
Documented: fcntl.flock is advisory/local-FS only; network/synced FS is not
guaranteed.
Addressed by scope: removed the unused speculative helpers from this PR (will use in future PRs).
Fixed: hashes.json init now uses the same atomic JSON writer.

Verified with:
uv run pytest tests/test_locks.py tests/test_config.py tests/test_cli.py tests/ test_lint_cli.py

@gwokhou

gwokhou commented Jun 14, 2026 •

edited

Loading

Copy link

Copy Markdown

Contributor Author

Tested this together with #15 (SQLite registry) on current main: they merge cleanly (both touch state.py/cli.py but in non-overlapping regions) and the full suite passes — 705 passed on the combined branch. Also ran it through a real 165-doc ingest. The atomic writes + ingest.lock fixed a hashes.json/index.md read-modify-write race I hit when two ingest processes touched the same KB. 👍

@cnndabbler Thanks for testing this with #15 and a real 165-doc ingest. Very helpful validation. I’ve rebased and updated this PR while keeping the scope limited to the locking/atomic-write boundary, so #15 should still be able to layer on top cleanly.

@gwokhou gwokhou requested a review from KylinMountain

June 14, 2026 08:55

@KylinMountain


 Merge main into pr/kb-mutation-locking (resolve cli.py import conflict)

40a64fb

@KylinMountain

KylinMountain commented Jun 14, 2026

Copy link

Copy Markdown

Collaborator

Thanks for the quick turnaround, @gwokhou — all 10 review points are addressed and the lock / atomic-write implementation looks solid. Since #96 (the doc_name collision fix) just landed on main, your branch picked up a small conflict in openkb/cli.py, so I merged latest main in and resolved it for you (_registry_path now sits alongside your locks imports, and #96's add_single_file changes live in your renamed _add_single_file_locked). Full suite green —728 passed. 🙏

I'll go ahead and merge this PR — really appreciate your contribution!

KylinMountain

KylinMountain approved these changes

Jun 14, 2026

View reviewed changes

@KylinMountain KylinMountain left a comment

Copy link

Copy Markdown

Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@KylinMountain KylinMountain merged commit 261fab4 into VectifyAI:main

Jun 14, 2026

Labels

None yet

3 participants

@gwokhou @cnndabbler @KylinMountain

Conversation

@gwokhou gwokhou commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Changes

Notes

Testing

Uh oh!

@KylinMountain KylinMountain left a comment

Choose a reason for hiding this comment

Review: KB mutation locks & atomic state writes

🔴 1. watch holds the exclusive lock for its entire lifetime

🔴 2. chat bypasses the lock protocol entirely

🟠 3. register_kb global-config read-modify-write is unguarded (wrong lock altitude)

🟠 4. lint holds the exclusive lock across a multi-minute LLM run

🟡 5. atomic_write_* never fsyncs the parent directory after os.replace

🟡 6. mkstemp silently tightens file permissions to 0600

🟡 7. Shared read lock is serialized within a process

🟡 8. fcntl.flock is silently ineffective on network filesystems

⚪ 9. Dead code + divergent KB-detection predicate

⚪ 10. hashes.json format flip on first write

Uh oh!

cnndabbler commented Jun 9, 2026

Uh oh!

gwokhou commented Jun 14, 2026

Uh oh!

gwokhou commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KylinMountain commented Jun 14, 2026

Uh oh!

@KylinMountain KylinMountain left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

@gwokhou gwokhou commented Jun 3, 2026 •

edited

Loading

🔴 1. `watch` holds the exclusive lock for its entire lifetime

🔴 2. `chat` bypasses the lock protocol entirely

🟠 3. `register_kb` global-config read-modify-write is unguarded (wrong lock altitude)

🟠 4. `lint` holds the exclusive lock across a multi-minute LLM run

🟡 5. `atomic_write_*` never fsyncs the parent directory after `os.replace`

🟡 6. `mkstemp` silently tightens file permissions to 0600

🟡 8. `fcntl.flock` is silently ineffective on network filesystems

⚪ 10. `hashes.json` format flip on first write

gwokhou commented Jun 14, 2026 •

edited

Loading