Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

feat: add KB mutation locks and atomic state writes#86

Merged
KylinMountain merged 4 commits into
VectifyAI:main from
gwokhou:pr/kb-mutation-locking
Jun 14, 2026
Merged

feat: add KB mutation locks and atomic state writes #86
KylinMountain merged 4 commits into
VectifyAI:main from
gwokhou:pr/kb-mutation-locking

Conversation

@gwokhou

@gwokhou gwokhou commented Jun 3, 2026
edited
Loading

Copy link
Copy Markdown
Contributor

Summary

This PR adds cooperative KB-level locking and atomic state/config writes.

It is a reliability hardening change for the current CLI/chat flows and a foundation for future concurrent ingestion / compilation work. The scope is intentionally limited to synchronization and safer persistence; concurrent ingestion and transactional recovery are left for follow-up PRs.

Why

Several OpenKB commands mutate shared KB files such as .openkb/hashes.json, .openkb/config.yaml, wiki/, raw/, and generated reports. If two OpenKB processes modify the same KB at the same time, or if a process is interrupted while writing state, the KB can be left in a partially written or inconsistent state.

This matters today for commands such as openkb add, openkb remove, openkb recompile, openkb watch, openkb lint --fix, and matching chat slash commands.

Changes

  • Add openkb.locks with KB-level advisory locks:
    • exclusive locks for mutation sections
    • shared locks for read-only sections
    • reentrant same-thread lock handling
    • in-process concurrent shared readers
    • explicit rejection of read-to-write lock upgrades
  • Document the local-filesystem/advisory nature of the fcntl.flock protocol.
  • Add atomic write helpers for text and JSON writes.
  • Preserve target file permissions on atomic replace and fsync the parent directory after os.replace.
  • Use atomic writes for:
    • KB config writes
    • global config writes
    • hash registry persistence
    • initial hashes.json seeding during openkb init
  • Guard global config read-modify-write with a global config lock.
  • Apply KB locks at the operation boundary:
    • add_single_file takes the mutation lock, so CLI add, watch, and chat /add are covered without locking the full watch lifetime.
    • lint --fix takes the mutation lock only while applying fixes.
    • run_lint takes a shared read lock while reading/running lint and a mutation lock only when writing the report/log.
    • chat /save, /status, /list, and query logging use the appropriate locks.
  • Keep unused speculative helpers out of this PR.

Notes

The lock is advisory and cooperative: it protects OpenKB commands that use these helpers. It does not make arbitrary external edits to the KB directory safe, and it does not guarantee cross-host safety on networked or synced filesystems where fcntl.flock may be unavailable or inconsistent.

Testing

  • uv run pytest tests/test_locks.py tests/test_config.py tests/test_cli.py tests/test_lint_cli.py

@KylinMountain KylinMountain left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: KB mutation locks & atomic state writes

The lock + atomic-write primitives are well built (clean reentrancy accounting, correct os.replace pattern, good tests). My concern is mostly where the locks are applied — a couple of high-impact gaps make the protection ineffective or actively disruptive in the most common usage. Findings ranked by severity.

🔴 1. watch holds the exclusive lock for its entire lifetime

watch is decorated @_with_kb_lock(exclusive=True), but watch_directory() blocks indefinitely (while observer.is_alive(): observer.join(...)). So the exclusive lock is held for the whole watch session — every other command, including read-only list/status (shared lock, blocked by the exclusive holder), is locked out until watch is killed.
→ Take the lock per-ingestion inside on_new_files (around each add_single_file), not around the whole command.

🔴 2. chat bypasses the lock protocol entirely

chat has no _with_kb_lock decorator, yet its REPL mutates the KB via /add (add_single_file, chat.py:495/500) and /lint (run_lint, chat.py:752). Those functions hold no internal lock — locking lives only in the CLI decorator. So a /add inside an open chat session races freely against a concurrent openkb add (which did take the exclusive lock): the two ingests interleave and one's concept-page / hashes.json writes silently clobber the other's. The exclusive lock on CLI add is worthless against this path.
→ Either lock the chat mutation operations, or push locking down into add_single_file/run_lint.

🟠 3. register_kb global-config read-modify-write is unguarded (wrong lock altitude)

register_kb does load_global_config() → append → save_global_config() on the shared ~/.config/openkb/global.yaml. The new locks are per-KB (.openkb/ingest.lock), and init/use aren't decorated at all. atomic_write_text makes each write un-torn but does not serialize the load→append→save sequence. Two concurrent init/use in different KBs → both read the same known_kbs, each appends its own path, last writer wins → the other registration (and default_kb) is silently lost. This is exactly the lost-update the PR title implies it prevents, but the lock is scoped to the wrong file. Consider a global-config lock (or pushing the lock inside register_kb).

🟠 4. lint holds the exclusive lock across a multi-minute LLM run

lint is exclusive, but without --fix it's read-only, and run_lint invokes an LLM knowledge-lint agent that can run for minutes — all while holding the write lock. Every list/status from another process blocks for that whole window. Use a shared lock for read-only lint, or take the write lock only around the --fix link rewrite and release before the LLM phase.

🟡 5. atomic_write_* never fsyncs the parent directory after os.replace

The temp file is fsync'd, but the directory entry (the rename) is not. On ext4/xfs a crash right after os.replace can lose the rename while the data survives — a reader after reboot sees the old content even though the call "succeeded". Atomicity holds; durability of the commit doesn't. Add an os.open(parent, O_DIRECTORY) + os.fsync after replace. (atomic_write_binary has the same gap.)

🟡 6. mkstemp silently tightens file permissions to 0600

tempfile.mkstemp creates the temp at 0600 and os.replace carries that onto the destination, ignoring umask. A config.yaml that was 0644 becomes owner-only after the first save_config/save_global_config rewrite, and every new file is 0600 regardless of umask — a behavior change from the old path.open("w"). chmod the temp to match the existing file's mode (or to 0o666 & ~umask) before replace.

🟡 7. Shared read lock is serialized within a process

The per-path RLock is held for the whole lock section, so two threads in the same process can never hold LOCK_SH concurrently — first-acquire from a second thread blocks on the RLock before reaching the OS shared lock. Cross-process readers are fine, so impact is limited today, but the "shared read lock" doesn't deliver shared semantics in-process (relevant if chat/agent ever reads concurrently across threads).

🟡 8. fcntl.flock is silently ineffective on network filesystems

KBs are often kept on NFS / Dropbox / iCloud Drive. flock may be a no-op there; the code records the lock as held and proceeds with no detection, so two hosts can mutate the same KB concurrently and corrupt state. Worth documenting the limitation, or detecting non-local backing stores.

⚪ 9. Dead code + divergent KB-detection predicate

maybe_kb_ingest_lock, maybe_kb_read_lock, and atomic_write_binary are defined but referenced nowhere. The decorator reimplements "lock only if a KB exists" inline using a different predicate (_find_kb_dir(...) is None) than maybe_* ((kb_dir/'.openkb').is_dir()). A future caller reaching for the conveniently-named maybe_* would get different behavior (no cwd-walk / global-default resolution) and could lock the wrong directory. Suggest deleting the unused helpers and keeping one source of truth.

⚪ 10. hashes.json format flip on first write

atomic_write_json appends a trailing newline, so _persist writes {...}\n, but init seeds the file via json.dumps({}) with no newline. The byte format flips on the first add() (spurious diff / checksum mismatch). Route the init seed through atomic_write_json too. (Minor related note: failed writes under SIGKILL/ENOSPC can leave hidden .<name>.*.tmp orphans in .openkb/ that nothing sweeps.)


#1 and #2 are the ones I'd block on — they make the locking ineffective (chat) or actively disruptive (watch) in normal use. The rest are incremental hardening.

gwokhou reacted with heart emoji

Copy link
Copy Markdown
Contributor

Tested this together with #15 (SQLite registry) on current main: they merge cleanly (both touch state.py/cli.py but in non-overlapping regions) and the full suite passes — 705 passed on the combined branch. Also ran it through a real 165-doc ingest. The atomic writes + ingest.lock fixed a hashes.json/index.md read-modify-write race I hit when two ingest processes touched the same KB. 👍

gwokhou reacted with heart emoji

@gwokhou gwokhou force-pushed the pr/kb-mutation-locking branch from 5c4cfbc to 563f3a9 Compare June 14, 2026 08:42

gwokhou commented Jun 14, 2026

Copy link
Copy Markdown
Contributor Author

@KylinMountain Thanks for the review. I updated the PR as follows:

  1. Fixed: watch no longer holds the lock for the full lifetime.
  2. Fixed: chat /add, /lint, /save, /status, /list, and query logging now use
    the lock protocol.
  3. Fixed: register_kb global config RMW is now locked.
  4. Fixed: lint no longer holds an exclusive lock across the LLM run.
  5. Fixed: atomic replace now fsyncs the parent directory.
  6. Fixed: atomic writes now preserve/derive file mode correctly.
  7. Fixed: in-process shared reads can now run concurrently.
  8. Documented: fcntl.flock is advisory/local-FS only; network/synced FS is not
    guaranteed.
  9. Addressed by scope: removed the unused speculative helpers from this PR (will use in future PRs).
  10. Fixed: hashes.json init now uses the same atomic JSON writer.

Verified with:
uv run pytest tests/test_locks.py tests/test_config.py tests/test_cli.py tests/ test_lint_cli.py

gwokhou commented Jun 14, 2026
edited
Loading

Copy link
Copy Markdown
Contributor Author

Tested this together with #15 (SQLite registry) on current main: they merge cleanly (both touch state.py/cli.py but in non-overlapping regions) and the full suite passes — 705 passed on the combined branch. Also ran it through a real 165-doc ingest. The atomic writes + ingest.lock fixed a hashes.json/index.md read-modify-write race I hit when two ingest processes touched the same KB. 👍

@cnndabbler Thanks for testing this with #15 and a real 165-doc ingest. Very helpful validation. I’ve rebased and updated this PR while keeping the scope limited to the locking/atomic-write boundary, so #15 should still be able to layer on top cleanly.

Copy link
Copy Markdown
Collaborator

Thanks for the quick turnaround, @gwokhou — all 10 review points are addressed and the lock / atomic-write implementation looks solid. Since #96 (the doc_name collision fix) just landed on main, your branch picked up a small conflict in openkb/cli.py, so I merged latest main in and resolved it for you (_registry_path now sits alongside your locks imports, and #96's add_single_file changes live in your renamed _add_single_file_locked). Full suite green —728 passed. 🙏

I'll go ahead and merge this PR — really appreciate your contribution!

@KylinMountain KylinMountain left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@KylinMountain KylinMountain merged commit 261fab4 into VectifyAI:main Jun 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

@KylinMountain KylinMountain KylinMountain approved these changes

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

AltStyle によって変換されたページ (->オリジナル) /