-
Notifications
You must be signed in to change notification settings - Fork 229
feat(parsers): pluggable document parsers — MinerU / Mistral / VLM (closes #77) #81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
+1,383
−34
Open
Changes from all commits
Commits
Show all changes
26 commits
Select commit
Hold shift + click to select a range
7a5f781
feat(parsers): add ParseResult and Parser ABC (#77)
KylinMountain 592d11d
feat(images): add localize_images helper for parser output (#77)
KylinMountain cec3618
fix(images): use replacement function in localize_images to handle ar...
KylinMountain 455c746
feat(parsers): add LocalParser wrapping legacy extraction (#77)
KylinMountain 0978cbf
feat(parsers): add registry + get_parser factory (#77)
KylinMountain ed3368d
refactor(converter): route file→markdown through parser abstraction (...
KylinMountain 27d314c
docs(converter): refresh convert_document docstring for parser flow; ...
KylinMountain 36e8b4f
feat(parsers): add reusable litellm vision client (#77)
KylinMountain 8b1d4eb
fix(parsers): use litellm file content part for PDFs in vlm_client (#77)
KylinMountain 2a93eec
feat(parsers): add VLMParser (vision LLM via litellm) (#77)
KylinMountain 2c7e693
feat(parsers): add MistralParser via mistralai SDK (#77)
KylinMountain 50b83bb
feat(parsers): log skipped undecodable Mistral images (#77)
KylinMountain e452a3a
feat(parsers): add MineruParser (cloud + self-hosted HTTP) (#77)
KylinMountain a6074f6
test(parsers): cover MinerU cloud poll+download flow (#77)
KylinMountain 82958e4
feat(cli): add --parser override and default parser config (#77)
KylinMountain 995b90c
feat(cli): validate --parser against valid set via click.Choice (#77)
KylinMountain 2959a8d
build: add optional parser extras (mistral, mineru, parsers) (#77)
KylinMountain 33cee68
docs(readme): document pluggable document parsers (#77)
KylinMountain 526db30
fix(parsers): harden MinerU poll loop and anchor image-link rewrite (...
KylinMountain 8af174f
fix(parsers): sanitize image filenames against path traversal; skip r...
KylinMountain e424bb4
fix(parsers): warn on VLM global-model fallback; unify parser dispatc...
KylinMountain a981b91
fix(cli): only propagate LLM_API_KEY to the active provider key (#77)
KylinMountain 6287dea
fix(images): match image links by basename (dir-prefixed, titled) in ...
KylinMountain 6e111fb
fix(parsers): harden MinerU cloud response handling, timeout, md sele...
KylinMountain 02daf52
fix(parsers): warn that VLM is text-only and on silent parser downgra...
KylinMountain b243505
fix(parsers): delete uploaded Mistral OCR files; fix patch.stopall te...
KylinMountain File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
5 changes: 5 additions & 0 deletions
openkb/parsers/__init__.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| """Pluggable document parsers for the file → Markdown step.""" | ||
| from openkb.parsers.base import ParseResult, Parser | ||
| from openkb.parsers.registry import get_parser | ||
|
|
||
| __all__ = ["ParseResult", "Parser", "get_parser"] |
33 changes: 33 additions & 0 deletions
openkb/parsers/base.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,33 @@ | ||
| from __future__ import annotations | ||
|
|
||
| from abc import ABC, abstractmethod | ||
| from dataclasses import dataclass, field | ||
| from pathlib import Path | ||
|
|
||
|
|
||
| @dataclass | ||
| class ParseResult: | ||
| """Normalized output of a parser. | ||
|
|
||
| ``markdown`` references images either as bare filenames present in | ||
| ``images`` or as inline base64 data URIs. ``images`` maps a filename to | ||
| its raw bytes; the caller persists them and rewrites links via | ||
| :func:`openkb.images.localize_images`. | ||
| """ | ||
|
|
||
| markdown: str | ||
| images: dict[str, bytes] = field(default_factory=dict) | ||
|
|
||
|
|
||
| class Parser(ABC): | ||
| """Converts a source document to Markdown.""" | ||
|
|
||
| name: str | ||
|
|
||
| @abstractmethod | ||
| def supports(self, suffix: str) -> bool: | ||
| """Return True if this parser handles files with ``suffix`` (e.g. ``.pdf``).""" | ||
|
|
||
| @abstractmethod | ||
| def parse(self, src: Path) -> ParseResult: | ||
| """Parse ``src`` and return a :class:`ParseResult`.""" |
47 changes: 47 additions & 0 deletions
openkb/parsers/local.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,47 @@ | ||
| from __future__ import annotations | ||
|
|
||
| from pathlib import Path | ||
|
|
||
| from markitdown import MarkItDown | ||
|
|
||
| from openkb.images import ( | ||
| convert_pdf_with_images, | ||
| copy_relative_images, | ||
| extract_base64_images, | ||
| ) | ||
| from openkb.parsers.base import ParseResult, Parser | ||
|
|
||
| _LOCAL_EXTENSIONS = { | ||
| ".pdf", ".md", ".markdown", ".docx", ".pptx", ".xlsx", ".xls", | ||
| ".html", ".htm", ".txt", ".csv", | ||
| } | ||
|
|
||
|
|
||
| class LocalParser(Parser): | ||
| """Default parser: pymupdf for PDF, markitdown for office/html, direct read for md.""" | ||
|
|
||
| name = "local" | ||
|
|
||
| def __init__(self, doc_name: str = "", images_dir: Path | None = None, | ||
| source_dir: Path | None = None): | ||
| self.doc_name = doc_name | ||
| self.images_dir = images_dir | ||
| self.source_dir = source_dir | ||
|
|
||
| def supports(self, suffix: str) -> bool: | ||
| return suffix.lower() in _LOCAL_EXTENSIONS | ||
|
|
||
| def parse(self, src: Path) -> ParseResult: | ||
| suffix = src.suffix.lower() | ||
| if suffix in {".md", ".markdown"}: | ||
| markdown = src.read_text(encoding="utf-8") | ||
| markdown = copy_relative_images( | ||
| markdown, src.parent, self.doc_name, self.images_dir | ||
| ) | ||
| elif suffix == ".pdf": | ||
| markdown = convert_pdf_with_images(src, self.doc_name, self.images_dir) | ||
| else: | ||
| mid = MarkItDown() | ||
| markdown = mid.convert(str(src)).text_content | ||
| markdown = extract_base64_images(markdown, self.doc_name, self.images_dir) | ||
| return ParseResult(markdown=markdown) |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.