Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Mitigate PageIndex file-descriptor buildup during long PDF indexing#44

Open
plasma16 wants to merge 8 commits into
VectifyAI:main from
plasma16:fix/pageindex-fd-cleanup
Open

Mitigate PageIndex file-descriptor buildup during long PDF indexing #44
plasma16 wants to merge 8 commits into
VectifyAI:main from
plasma16:fix/pageindex-fd-cleanup

Conversation

@plasma16

@plasma16 plasma16 commented May 7, 2026
edited
Loading

Copy link
Copy Markdown

Purpose

Reduce risk of [Errno 24] Too many open files during long-PDF indexing with retries.

Changes

  • Add best-effort PageIndex client cleanup helper in openkb/indexer.py.
  • Create a fresh PageIndexClient per retry attempt.
  • Explicitly close backend storage after failed attempts and at function exit.
  • Trigger gc.collect() between failed attempts to accelerate descriptor release in long runs.

Why this helps

In local mode, repeated failed/retried indexing can leave resources alive longer than expected. This patch ensures resources are released promptly per attempt.

Test evidence

  • python3 -m py_compile openkb/indexer.py (pass)
  • Full pytest suite not run in this environment because pytest is not installed in system Python.

Notes

This is an OpenKB-side mitigation while upstream pageindex resource handling can be further hardened.

Copy link
Copy Markdown
Collaborator

@plasma16 Thank you for your contribution. However, this PR introduces changes to other functionalities. Could you please scope this PR to only address the current feature? Thank you.

KylinMountain pushed a commit that referenced this pull request Jun 12, 2026
litellm caches aiohttp clients per event loop. add_single_file runs each doc
via a fresh asyncio.run() loop, so the previous loop's clients are abandoned
and their HTTP connections linger in CLOSE-WAIT, accumulating sockets/FDs over
a long ingest (observed 200+ against a remote API on a 165-doc run).
Add _close_litellm_async_clients() (best-effort, never raises) and call it in
try/finally around index_long_document and both compile_short_doc /
compile_long_doc calls. Verified: CLOSE-WAIT returns to ~0 after each doc.
Supersedes the now-stale #44 (which carried the same intent on an old base).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

No reviews

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

AltStyle によって変換されたページ (->オリジナル) /