Add SQLite-backed registry with JSON migration support#15

run_lint was not migrated to get_registry() and still reads hashes.json directly. With SQLite as the new default backend, hashes.json is never created, so hashes is always {} and openkb lint exits with "Nothing to lint — no documents indexed yet" regardless of how many documents are actually indexed. print_list (line 602) and print_status (line 689) were migrated correctly — only this call site was missed.

OpenKB/openkb/cli.py

Lines 541 to 551 in 9124f51

# Skip lint entirely when the KB has no indexed documents

hashes_file = openkb_dir / "hashes.json"

if hashes_file.exists():

hashes = json.loads(hashes_file.read_text(encoding="utf-8"))

else:

hashes = {}

if not hashes:

click.echo("Nothing to lint — no documents indexed yet. Run `openkb add` first.")

return

JSON-to-SQLite migration is not atomic. _init_db() creates and commits the empty schema file before _migrate_from_json() runs. If the process is killed (SIGKILL, OOM, power loss, disk full) between those two steps, the DB file exists on disk but is empty. On the next startup, should_migrate = migrate_from is not None and not path.exists() evaluates to False, so migration is never retried — every entry in hashes.json is silently abandoned and openkb thinks the KB has never been ingested. The JSON file is preserved on disk but never read again. Fix: do _init_db + insert in a single transaction, or use a sentinel (schema_version row, completion marker) instead of file existence to gate migration.

OpenKB/openkb/state.py

Lines 82 to 94 in 9124f51

def __init__(self, path: Path, migrate_from: Path | None = None) -> None:

"""Initialize DbRegistry.

Args:

path: Path to SQLite database file.

migrate_from: Optional path to JSON file to migrate from.

Migration only happens if DB doesn't exist yet.

"""

self._path = path

should_migrate = migrate_from is not None and not path.exists()

self._init_db()

if should_migrate:

self._migrate_from_json(migrate_from)

Coordination note (not an issue, but worth flagging): PR #30 (still open) adds get_by_path and remove_by_doc_name to HashRegistry and calls them from cli.py / converter.py. DbRegistry here does not implement either method, so whichever PR lands second will need to add them to the other backend or users on storage_backend: sqlite will hit AttributeError on re-add.

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

kdush added 5 commits

May 13, 2026 11:14

@kdush


 feat: add SQLite-backed registry

280fce1

@kdush


 feat: add SQLite backend and migration tests

1397f7f

@kdush


 docs: document storage backend and migration

ed49b0a

@kdush


 fix(cli): run_lint 改用 get_registry() 读取注册表,不再直接访问 hashes.json

3be4233

review 反馈:当 storage_backend 为 sqlite(默认)时,hashes.json 不存在,
导致 lint 总是误判为无文档。对齐 print_list 和 print_status 的已有实现。

@kdush


 fix(state): JSON→SQLite 迁移改为原子/可重试,并预兼容 PR VectifyAI#30 接口

8cb4522

- 用 schema_meta 完成标记替代 DB 文件存在性判断,避免中断后永不重试
- get_registry() 只要 hashes.json 存在即传递 migrate_from,让 DbRegistry 自主判断
- 新增 get_by_path / remove_by_doc_name 到 HashRegistry 与 DbRegistry,
 防止后续 PR VectifyAI#30 合并时 SQLite 后端出现 AttributeError

@kdush kdush force-pushed the feat/sqlite-storage-backend branch from 9124f51 to 8cb4522 Compare

May 13, 2026 03:21

@KylinMountain

KylinMountain commented May 14, 2026

Copy link

Copy Markdown

Collaborator

Code review

Found 1 issue:

The two prior issues (run_lint registry usage, migration atomicity) are addressed. However, the new get_by_path / remove_by_doc_name methods that commit 8cb4522 claims as "预兼容 PR fix: hashing to avoid doc with duplicate name to collide #30 接口" implement the wrong semantics — they share method names with PR fix: hashing to avoid doc with duplicate name to collide #30 but match on different fields, so once PR fix: hashing to avoid doc with duplicate name to collide #30 lands, neither backend will work as PR fix: hashing to avoid doc with duplicate name to collide #30 's callers expect.
- get_by_path matches metadata["path"] or metadata["name"]. PR fix: hashing to avoid doc with duplicate name to collide #30 stores raw-file lookups in metadata["raw_path"] and source-file lookups in metadata["source_path"], and calls registry.get_by_path(path_key) with values that should match those. PR Add SQLite-backed registry with JSON migration support #15 's implementation never checks those keys, so the lookup returns None and PR fix: hashing to avoid doc with duplicate name to collide #30 's caller silently regenerates doc_name, creating duplicate wiki output.
- remove_by_doc_name matches metadata["name"]. PR fix: hashing to avoid doc with duplicate name to collide #30 stores the hash-suffixed slug (e.g. report-abc12345) in metadata["doc_name"] and calls remove_by_doc_name(slug). name holds the original filename (report.pdf), so the match never fires and stale entries are never removed.
- Both backends have the same bug (identical implementations in HashRegistry and DbRegistry), and PR fix: hashing to avoid doc with duplicate name to collide #30 's added tests check name-based removal, which the wrong implementation happens to satisfy — so test green doesn't catch the divergence.
HashRegistry implementations:

OpenKB/openkb/state.py

Lines 55 to 74 in 8cb4522

def get_by_path(self, path: str) -> dict | None:

"""Return metadata for the first entry whose 'path' or 'name' matches."""

for metadata in self._data.values():

if metadata.get("path") == path or metadata.get("name") == path:

return metadata

return None

def remove_by_doc_name(self, doc_name: str) -> bool:

"""Remove the first entry whose metadata 'name' matches doc_name.

Returns True if an entry was removed, False otherwise.

"""

for file_hash, metadata in list(self._data.items()):

if metadata.get("name") == doc_name:

del self._data[file_hash]

self._persist()

return True

return False

DbRegistry implementations:

OpenKB/openkb/state.py

Lines 236 to 266 in 8cb4522

def get_by_path(self, path: str) -> dict | None:

"""Return metadata for the first entry whose 'path' or 'name' matches."""

with self._connect() as conn:

rows = conn.execute("SELECT metadata_json FROM registry").fetchall()

for (metadata_json,) in rows:

metadata = json.loads(metadata_json)

if metadata.get("path") == path or metadata.get("name") == path:

return metadata

return None

def remove_by_doc_name(self, doc_name: str) -> bool:

"""Remove the first entry whose metadata 'name' matches doc_name.

Returns True if an entry was removed, False otherwise.

"""

with self._connect() as conn:

rows = conn.execute(

"SELECT file_hash, metadata_json FROM registry"

).fetchall()

for file_hash, metadata_json in rows:

metadata = json.loads(metadata_json)

if metadata.get("name") == doc_name:

conn.execute(

"DELETE FROM registry WHERE file_hash = ?",

(file_hash,),

)

return True

return False

@staticmethod

Fix: align the matching keys with PR fix: hashing to avoid doc with duplicate name to collide #30 — get_by_path should check path, raw_path, and source_path; remove_by_doc_name should match metadata["doc_name"] and remove all matching entries (PR fix: hashing to avoid doc with duplicate name to collide #30 expects multi-delete with None return).

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

@KylinMountain

KylinMountain commented May 14, 2026

Copy link

Copy Markdown

Collaborator

A few smaller follow-ups from the same pass that didn't make it into the main review — none are blocking, just worth tracking:

Migration sentinel write window. _migrate_from_json and _mark_migration_complete are two separate _connect() transactions. If the process dies after the migration rows commit but before the sentinel row commits, data is safely migrated, but _is_migration_complete() returns False forever and _is_empty() returns False, so the guard not _is_migration_complete() and _is_empty() is permanently False — the sentinel is never written and every subsequent startup pays the cost of reading schema_meta + counting registry rows just to skip migration. Combining both writes into a single _connect() block closes the window.

OpenKB/openkb/state.py

Lines 100 to 116 in 8cb4522

def __init__(self, path: Path, migrate_from: Path | None = None) -> None:

"""Initialize DbRegistry.

Args:

path: Path to SQLite database file.

migrate_from: Optional path to JSON file to migrate from.

Migration is retried if not previously completed

and the registry table is empty.

"""

self._path = path

self._init_db()

if migrate_from is not None and migrate_from.exists():

if not self._is_migration_complete() and self._is_empty():

self._migrate_from_json(migrate_from)

self._mark_migration_complete()

Backend toggle silently drops JSON-only entries. If a user is on sqlite (sentinel set), edits config to json and adds documents (entries land in hashes.json only), then switches back to sqlite, _is_migration_complete() returns True from step 1's sentinel and the new JSON entries are silently ignored — never imported, no warning. Either re-import when JSON mtime exceeds the sentinel's timestamp, or warn the user when JSON has entries not present in SQLite.

Missing DbRegistry coverage for the new methods. tests/test_state.py and tests/test_db_registry.py exercise get_by_path / remove_by_doc_name against HashRegistry only (or via the wrong matching field, as in the main review). Worth adding direct DbRegistry tests that pass realistic PR #30 inputs (raw_path for the path lookup; a hash slug like report-abc12345 for removal) — those would have caught the semantic mismatch flagged in the main review.

@kdush


 fix(state): 🐛 对齐 SQLite registry 评审语义

3451b79

修复 JSON 到 SQLite 迁移完成标记的原子写入
对齐 get_by_path 和 remove_by_doc_name 与 PR VectifyAI#30 的字段契约
补充 HashRegistry 与 DbRegistry 的回归测试

@kdush

kdush commented May 23, 2026

Copy link

Copy Markdown

Author

Addressed the review feedback in 3451b79.

Changes:

Made JSON-to-SQLite migration completion marker write atomic with the migrated rows, with rollback on failure.
Aligned HashRegistry and DbRegistry with PR fix: hashing to avoid doc with duplicate name to collide #30 semantics:
- get_by_path() now checks path, raw_path, and source_path.
- remove_by_doc_name() now matches doc_name, deletes all matching entries, and returns None.
Added regression coverage for both JSON and SQLite registry backends.

Validation:

pytest tests/test_state.py tests/test_db_registry.py tests/test_migration.py tests/test_cli.py tests/test_config_storage_backend.py tests/test_lint_cli.py → 52 passed
pytest → 254 passed, 1 warning
ruff check openkb/state.py tests/test_state.py tests/test_db_registry.py → passed

@kdush


 chore(merge): 🔧 合并上游 main 解决 PR 冲突

e2260a0

同步 origin/main 并解决 state、CLI remove 和相关测试冲突
保持 SQLite registry 默认 backend 与上游 remove 流程兼容

@kdush

kdush commented May 23, 2026

Copy link

Copy Markdown

Author

Resolved the merge conflicts with origin/main in e2260a0.

Notes:

Kept the SQLite registry fixes from this PR while preserving upstream remove_by_hash support.
Updated openkb remove to use get_registry() with the configured storage backend, matching the SQLite default backend.
Adjusted remove tests to read through the registry abstraction instead of assuming hashes.json when SQLite is active.

Validation:

pytest tests/test_state.py tests/test_db_registry.py tests/test_cli.py tests/test_remove.py → 105 passed
pytest → 555 passed, 5 warnings
ruff check openkb/state.py tests/test_state.py tests/test_db_registry.py tests/test_cli.py tests/test_remove.py → passed

PR is now mergeable on GitHub.

@kdush


 Merge remote-tracking branch 'origin/main' into feat/sqlite-storage-b...

ad7d146

...ackend

@kdush

kdush commented May 30, 2026

Copy link

Copy Markdown

Author

Synced to the latest main (ad7d146) — the only overlap was openkb/cli.py, which merged cleanly: this PR's run_lint → get_registry() fix is preserved and now sits correctly on top of the new openkb/lint.py module from #68.

Status recap — all earlier review feedback is addressed and there are no open items on my side:

run_lint reads through get_registry() (no more direct hashes.json access under the SQLite default).
JSON→SQLite migration completion marker is written atomically with the migrated rows, with rollback on failure.
get_by_path checks path / raw_path / source_path; remove_by_doc_name matches doc_name, deletes all matches, returns None — aligned with fix: hashing to avoid doc with duplicate name to collide #30 's semantics on both backends, with regression coverage.

Validation on the merged branch:

uv run pytest -q → 646 passed

One coordination question: how would you like to sequence this with #30, given the shared get_by_path / remove_by_doc_name contract? Happy to rebase on top of whichever lands first. Could I get a re-review when you have a moment? 🙏

@KylinMountain

KylinMountain commented Jun 1, 2026

Copy link

Copy Markdown

Collaborator

@kdush 建议先用cc先review 没问题了,我在帮你看哈。。。我找cc review了还是有不少问题

@kdush


 Merge remote-tracking branch 'origin/main' into feat/sqlite-storage-b...

9e4a56f

...ackend
# Conflicts:
#	README.md

@cnndabbler cnndabbler mentioned this pull request

Jun 9, 2026

feat: add KB mutation locks and atomic state writes #86

Merged

@cnndabbler

cnndabbler commented Jun 9, 2026

Copy link

Copy Markdown

Contributor

Adopted this alongside #86 (mutation locks) on current main — merges cleanly, 705 tests pass on the combined branch. The JSON→SQLite migration worked correctly on a real populated registry (165 docs migrated to hashes.db, dedup verified, old hashes.json preserved). Composes well with #86's locking. 👍

Conversation

@kdush kdush commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing

Backward Compatibility

Uh oh!

KylinMountain commented May 12, 2026

Code review

Uh oh!

KylinMountain commented May 14, 2026

Code review

Uh oh!

KylinMountain commented May 14, 2026

Uh oh!

kdush commented May 23, 2026

Uh oh!

kdush commented May 23, 2026

Uh oh!

kdush commented May 30, 2026

Uh oh!

KylinMountain commented Jun 1, 2026

Uh oh!

cnndabbler commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

@kdush kdush commented Apr 11, 2026 •

edited

Loading