Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

feat(cli): import existing PageIndex Cloud indices via add --from-pageindex-cloud (closes #88)#97

Open
KylinMountain wants to merge 7 commits into
main from
feat/import-pageindex-cloud
Open

feat(cli): import existing PageIndex Cloud indices via add --from-pageindex-cloud (closes #88) #97
KylinMountain wants to merge 7 commits into
main from
feat/import-pageindex-cloud

Conversation

@KylinMountain

@KylinMountain KylinMountain commented Jun 13, 2026
edited
Loading

Copy link
Copy Markdown
Collaborator

Summary

Implements #88 — import a document that is already indexed in PageIndex Cloud into a local OpenKB knowledge base, with no local PDF and no re-processing:

openkb add --from-pageindex-cloud <DOC_ID>
  • Fetches structure + OCR'd page content from PageIndex Cloud by doc_id (get_document / get_page_content), bypassing the local convert → raw → page-count → col.add pipeline entirely.
  • Writes the same wiki artifacts as the local long-doc path (shared _write_long_doc_artifacts) and compiles concepts via compile_long_doc.
  • Registers a raw-less registry entry typed pageindex_cloud with a synthetic identity key pageindex-cloud:<doc_id> (sha256-keyed; re-import of the same doc-id is idempotent/skipped).
  • openkb add runs under the exclusive KB lock, so the cloud-import path is serialized like every other ingest.
  • openkb remove on an imported doc cleans up local artifacts only — the user's cloud corpus is never touched (the existing long_pdf gate already excludes the new type; proven by a regression test).
  • Dependency pageindex switched to track git+https://github.com/VectifyAI/PageIndex.git@dev for the cloud API surface (lock pins a concrete commit). Follow-up: re-pin to an exact tag once a new PageIndex release is published.

Notes

Test Plan

Automated (full suite green — 742 passed on the rebased branch):

  • uv run --extra dev pytest -q
  • Targeted: uv run --extra dev pytest tests/test_converter.py tests/test_indexer.py tests/test_add_command.py tests/test_remove.py -v

Manual (requires PAGEINDEX_API_KEY and a real cloud doc-id):

  • export PAGEINDEX_API_KEY=... then openkb add --from-pageindex-cloud <DOC_ID> → wiki/summaries + wiki/sources written, concepts compiled, doc appears in openkb list.
  • Re-run the same command → [SKIP] Already imported (idempotent).
  • openkb add foo.pdf --from-pageindex-cloud X → errors "not both"; openkb add with neither → "Provide a PATH...".
  • Unset PAGEINDEX_API_KEY then import → clear error, nothing written.
  • openkb remove <doc> on the imported doc → local artifacts removed, cloud doc still present in PageIndex.

@KylinMountain KylinMountain changed the base branch from fix/doc-name-collision to main June 14, 2026 11:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

No reviews

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

1 participant

AltStyle によって変換されたページ (->オリジナル) /