-
Notifications
You must be signed in to change notification settings - Fork 230
Conversation
a56ee15 to
9436ad6
Compare
71f9b4f to
9124f51
Compare
KylinMountain
commented
May 12, 2026
Code review
Found 2 issues:
run_lintwas not migrated toget_registry()and still readshashes.jsondirectly. With SQLite as the new default backend,hashes.jsonis never created, sohashesis always{}andopenkb lintexits with "Nothing to lint — no documents indexed yet" regardless of how many documents are actually indexed.print_list(line 602) andprint_status(line 689) were migrated correctly — only this call site was missed.
Lines 541 to 551 in 9124f51
- JSON-to-SQLite migration is not atomic.
_init_db()creates and commits the empty schema file before_migrate_from_json()runs. If the process is killed (SIGKILL, OOM, power loss, disk full) between those two steps, the DB file exists on disk but is empty. On the next startup,should_migrate = migrate_from is not None and not path.exists()evaluates toFalse, so migration is never retried — every entry inhashes.jsonis silently abandoned andopenkbthinks the KB has never been ingested. The JSON file is preserved on disk but never read again. Fix: do_init_db+ insert in a single transaction, or use a sentinel (schema_versionrow, completion marker) instead of file existence to gate migration.
Lines 82 to 94 in 9124f51
Coordination note (not an issue, but worth flagging): PR #30 (still open) adds get_by_path and remove_by_doc_name to HashRegistry and calls them from cli.py / converter.py. DbRegistry here does not implement either method, so whichever PR lands second will need to add them to the other backend or users on storage_backend: sqlite will hit AttributeError on re-add.
🤖 Generated with Claude Code
- If this code review was useful, please react with 👍. Otherwise, react with 👎.
review 反馈:当 storage_backend 为 sqlite(默认)时,hashes.json 不存在, 导致 lint 总是误判为无文档。对齐 print_list 和 print_status 的已有实现。
- 用 schema_meta 完成标记替代 DB 文件存在性判断,避免中断后永不重试 - get_registry() 只要 hashes.json 存在即传递 migrate_from,让 DbRegistry 自主判断 - 新增 get_by_path / remove_by_doc_name 到 HashRegistry 与 DbRegistry, 防止后续 PR VectifyAI#30 合并时 SQLite 后端出现 AttributeError
9124f51 to
8cb4522
Compare
KylinMountain
commented
May 14, 2026
Code review
Found 1 issue:
-
The two prior issues (
run_lintregistry usage, migration atomicity) are addressed. However, the newget_by_path/remove_by_doc_namemethods that commit8cb4522claims as "预兼容 PR fix: hashing to avoid doc with duplicate name to collide #30 接口" implement the wrong semantics — they share method names with PR fix: hashing to avoid doc with duplicate name to collide #30 but match on different fields, so once PR fix: hashing to avoid doc with duplicate name to collide #30 lands, neither backend will work as PR fix: hashing to avoid doc with duplicate name to collide #30 's callers expect.get_by_pathmatchesmetadata["path"]ormetadata["name"]. PR fix: hashing to avoid doc with duplicate name to collide #30 stores raw-file lookups inmetadata["raw_path"]and source-file lookups inmetadata["source_path"], and callsregistry.get_by_path(path_key)with values that should match those. PR Add SQLite-backed registry with JSON migration support #15 's implementation never checks those keys, so the lookup returnsNoneand PR fix: hashing to avoid doc with duplicate name to collide #30 's caller silently regeneratesdoc_name, creating duplicate wiki output.remove_by_doc_namematchesmetadata["name"]. PR fix: hashing to avoid doc with duplicate name to collide #30 stores the hash-suffixed slug (e.g.report-abc12345) inmetadata["doc_name"]and callsremove_by_doc_name(slug).nameholds the original filename (report.pdf), so the match never fires and stale entries are never removed.- Both backends have the same bug (identical implementations in
HashRegistryandDbRegistry), and PR fix: hashing to avoid doc with duplicate name to collide #30 's added tests checkname-based removal, which the wrong implementation happens to satisfy — so test green doesn't catch the divergence.
HashRegistry implementations:
Lines 55 to 74 in 8cb4522
def get_by_path(self, path: str) -> dict | None:"""Return metadata for the first entry whose 'path' or 'name' matches."""for metadata in self._data.values():if metadata.get("path") == path or metadata.get("name") == path:return metadatareturn Nonedef remove_by_doc_name(self, doc_name: str) -> bool:"""Remove the first entry whose metadata 'name' matches doc_name.Returns True if an entry was removed, False otherwise."""for file_hash, metadata in list(self._data.items()):if metadata.get("name") == doc_name:del self._data[file_hash]self._persist()return Truereturn FalseDbRegistry implementations:
Lines 236 to 266 in 8cb4522
def get_by_path(self, path: str) -> dict | None:"""Return metadata for the first entry whose 'path' or 'name' matches."""with self._connect() as conn:rows = conn.execute("SELECT metadata_json FROM registry").fetchall()for (metadata_json,) in rows:metadata = json.loads(metadata_json)if metadata.get("path") == path or metadata.get("name") == path:return metadatareturn Nonedef remove_by_doc_name(self, doc_name: str) -> bool:"""Remove the first entry whose metadata 'name' matches doc_name.Returns True if an entry was removed, False otherwise."""with self._connect() as conn:rows = conn.execute("SELECT file_hash, metadata_json FROM registry").fetchall()for file_hash, metadata_json in rows:metadata = json.loads(metadata_json)if metadata.get("name") == doc_name:conn.execute("DELETE FROM registry WHERE file_hash = ?",(file_hash,),)return Truereturn False@staticmethodFix: align the matching keys with PR fix: hashing to avoid doc with duplicate name to collide #30 —
get_by_pathshould checkpath,raw_path, andsource_path;remove_by_doc_nameshould matchmetadata["doc_name"]and remove all matching entries (PR fix: hashing to avoid doc with duplicate name to collide #30 expects multi-delete withNonereturn).
🤖 Generated with Claude Code
- If this code review was useful, please react with 👍. Otherwise, react with 👎.
KylinMountain
commented
May 14, 2026
A few smaller follow-ups from the same pass that didn't make it into the main review — none are blocking, just worth tracking:
Migration sentinel write window. _migrate_from_json and _mark_migration_complete are two separate _connect() transactions. If the process dies after the migration rows commit but before the sentinel row commits, data is safely migrated, but _is_migration_complete() returns False forever and _is_empty() returns False, so the guard not _is_migration_complete() and _is_empty() is permanently False — the sentinel is never written and every subsequent startup pays the cost of reading schema_meta + counting registry rows just to skip migration. Combining both writes into a single _connect() block closes the window.
Lines 100 to 116 in 8cb4522
Backend toggle silently drops JSON-only entries. If a user is on sqlite (sentinel set), edits config to json and adds documents (entries land in hashes.json only), then switches back to sqlite, _is_migration_complete() returns True from step 1's sentinel and the new JSON entries are silently ignored — never imported, no warning. Either re-import when JSON mtime exceeds the sentinel's timestamp, or warn the user when JSON has entries not present in SQLite.
Missing DbRegistry coverage for the new methods. tests/test_state.py and tests/test_db_registry.py exercise get_by_path / remove_by_doc_name against HashRegistry only (or via the wrong matching field, as in the main review). Worth adding direct DbRegistry tests that pass realistic PR #30 inputs (raw_path for the path lookup; a hash slug like report-abc12345 for removal) — those would have caught the semantic mismatch flagged in the main review.
修复 JSON 到 SQLite 迁移完成标记的原子写入 对齐 get_by_path 和 remove_by_doc_name 与 PR VectifyAI#30 的字段契约 补充 HashRegistry 与 DbRegistry 的回归测试
kdush
commented
May 23, 2026
Addressed the review feedback in 3451b79.
Changes:
- Made JSON-to-SQLite migration completion marker write atomic with the migrated rows, with rollback on failure.
- Aligned
HashRegistryandDbRegistrywith PR fix: hashing to avoid doc with duplicate name to collide #30 semantics:get_by_path()now checkspath,raw_path, andsource_path.remove_by_doc_name()now matchesdoc_name, deletes all matching entries, and returnsNone.
- Added regression coverage for both JSON and SQLite registry backends.
Validation:
pytest tests/test_state.py tests/test_db_registry.py tests/test_migration.py tests/test_cli.py tests/test_config_storage_backend.py tests/test_lint_cli.py→ 52 passedpytest→ 254 passed, 1 warningruff check openkb/state.py tests/test_state.py tests/test_db_registry.py→ passed
同步 origin/main 并解决 state、CLI remove 和相关测试冲突 保持 SQLite registry 默认 backend 与上游 remove 流程兼容
kdush
commented
May 23, 2026
Resolved the merge conflicts with origin/main in e2260a0.
Notes:
- Kept the SQLite registry fixes from this PR while preserving upstream
remove_by_hashsupport. - Updated
openkb removeto useget_registry()with the configured storage backend, matching the SQLite default backend. - Adjusted remove tests to read through the registry abstraction instead of assuming
hashes.jsonwhen SQLite is active.
Validation:
pytest tests/test_state.py tests/test_db_registry.py tests/test_cli.py tests/test_remove.py→ 105 passedpytest→ 555 passed, 5 warningsruff check openkb/state.py tests/test_state.py tests/test_db_registry.py tests/test_cli.py tests/test_remove.py→ passed
PR is now mergeable on GitHub.
kdush
commented
May 30, 2026
Synced to the latest main (ad7d146) — the only overlap was openkb/cli.py, which merged cleanly: this PR's run_lint → get_registry() fix is preserved and now sits correctly on top of the new openkb/lint.py module from #68.
Status recap — all earlier review feedback is addressed and there are no open items on my side:
run_lintreads throughget_registry()(no more directhashes.jsonaccess under the SQLite default).- JSON→SQLite migration completion marker is written atomically with the migrated rows, with rollback on failure.
get_by_pathcheckspath/raw_path/source_path;remove_by_doc_namematchesdoc_name, deletes all matches, returnsNone— aligned with fix: hashing to avoid doc with duplicate name to collide #30 's semantics on both backends, with regression coverage.
Validation on the merged branch:
uv run pytest -q→ 646 passed
One coordination question: how would you like to sequence this with #30, given the shared get_by_path / remove_by_doc_name contract? Happy to rebase on top of whichever lands first. Could I get a re-review when you have a moment? 🙏
KylinMountain
commented
Jun 1, 2026
@kdush 建议先用cc先review 没问题了,我在帮你看哈。。。我找cc review了还是有不少问题
...ackend # Conflicts: # README.md
cnndabbler
commented
Jun 9, 2026
Adopted this alongside #86 (mutation locks) on current main — merges cleanly, 705 tests pass on the combined branch. The JSON→SQLite migration worked correctly on a real populated registry (165 docs migrated to hashes.db, dedup verified, old hashes.json preserved). Composes well with #86's locking. 👍
Uh oh!
There was an error while loading. Please reload this page.
Summary
Add SQLite-backed registry as the default storage backend, with automatic JSON migration support.
Changes
DbRegistryclass with SQLite backend, WAL mode, and JSON migrationstorage_backendconfig option (sqlite | json)get_registry()factoryTesting
hashes.db,hashes.db-wal,hashes.db-shmBackward Compatibility
storage_backend: jsonstill works for existing setupshashes.jsontohashes.dbwhen switching to SQLitehashes.jsonpreserved after migration for safety