Optional save-to-markdown for getting full extractions out
OCR is opt-in (default off, since tesseract.js downloads language data on first use)
Stack
TypeScript, runs inside opencode's plugin system. Uses pdf-parse, mammoth, xlsx, tesseract.js, cheerio, and jszip under the hood β all in-process JS libraries, no binaries.
What I'd do differently
The image OCR path is the weakest link. tesseract.js works but it's slow on large images and the initial language data download is clunky. If I rewrote it today I'd probably reach for a smaller WASM OCR engine, but for now it's opt-in so it doesn't get in the way.
Repo: https://github.com/TejasS1233/opencode-parser
MIT, contributions welcome.