Digitizes an entire book into Markdown from page photos, PDFs, or EPUBs, using PaddleOCR-VL-1.5 via llama-server (local inference).
- miniforge or Anaconda
- llama-server (Vulkan recommended on Windows)
- GGUF model: PaddleOCR-VL-1.5-GGUF
python setup.py conda activate ocr-livre
Then configure the paths to llama-server and the models. The easiest way is to copy .env.example to .env and edit it, but you can also use environment variables or CLI arguments — see docs/SETUP.md for all options.
cp .env.example .env
# Edit .env and set LLAMA_SERVER_PATH, MODEL_PATH and MMPROJ_PATHocr-livre/
├── src/
│ ├── main.py # CLI entry point
│ ├── config.py # Central configuration (dataclass)
│ ├── ocr_client.py # OCR of an image via PaddleOCRVL
│ ├── postprocess.py # OCR text cleanup
│ ├── obsidian.py # Obsidian export (wikilinks, migration)
│ ├── images.py # Image collection and renaming
│ ├── pipeline.py # Full orchestration
│ ├── progress.py # Logging and statistics
│ ├── pdf.py # PDF processing (text extraction or render → OCR)
│ └── epub.py # EPUB extraction (Pandoc-based)
├── docs/
│ ├── architecture/ # Architecture documentation
│ ├── dev/ # Patches and development notes
│ ├── SETUP.md # Installation instructions
│ ├── tested.md # Experiment results
│ └── issues.md # Work in progress
├── photos/ # Source images (one per page)
├── output/ # Generated Markdown + logs + figures
├── environment.yml # Conda dependencies
└── setup.py # Automated installation script
Run from the project root:
# Default pipeline (photos in ./photos, output output/book.md) python main.py # Specify folders python main.py --images ./my_photos --out output/my_book.md # PDF input python main.py --images ./book.pdf --out output/book.md # EPUB input python main.py --images ./book.epub --out output/book.md # Without layout detection python main.py --no-layout # Restart from the beginning python main.py --no-resume # Detailed logs python main.py --verbose # Dense tables — increase context if tables are truncated python main.py --n-ctx 12288 --n-parallel 3
A phone photo of a textbook page — charts, tables, and dense text — converted to clean Markdown in one command.
Left: original page photo. Right: extracted Markdown rendered.
PDFs are automatically classified as text-based (native text layer) or image-based (scanned).
- Text-based: extracts text natively with
pymupdf, detects figures with layout model, no VLM OCR. - Image-based: renders pages to images, then runs the normal OCR pipeline.
Choose the extraction method explicitly:
python main.py --images ./book.pdf --method text # fast, native text only python main.py --images ./book.pdf --method docling # structured extraction python main.py --images ./book.pdf --method paddleocrvl # best quality, slowest
EPUBs are converted to Markdown via Pandoc, with embedded figures extracted automatically.
python main.py --images ./book.epub --out output/book.md
In obsidian mode, the pipeline:
- converts figures to wikilinks
![[Files/image.jpg]] - saves the
.mddirectly into the vault - copies figures to
vault_path/vault_figures_dir/
Configure vault_path and vault_figures_dir in config.py, then:
# Full OCR + obsidian export python main.py --mode obsidian # Re-apply obsidian postprocess without re-running OCR python main.py --mode obsidian --postprocess-only # Migrate figures to the vault only python main.py --migrate
# Preview without modifying python main.py --rename --dry-run # Rename for real (→ page_001.jpg, page_002.jpg, ...) python main.py --rename # Rename without running OCR python main.py --rename-only # Process subfolders by chapter python main.py --rename-only --chapters "Chapter 1" "Chapter 2"
If the pipeline is interrupted, simply re-run:
python main.py
Already processed pages are automatically skipped.
--images PATH Photo folder, PDF, or EPUB (default: ./photos)
--out FILE Output Markdown file (default: output/book.md)
--llama-server PATH Path to llama-server executable (env: LLAMA_SERVER_PATH)
--model PATH Path to model .gguf (env: MODEL_PATH)
--mmproj PATH Path to mmproj .gguf (env: MMPROJ_PATH)
--mode {base,obsidian} Output mode (default: base)
--method {text,docling,paddleocrvl} PDF extraction method (default: paddleocrvl)
--no-layout Disable layout detection
--no-resume Restart from the beginning
--no-postprocess Raw output without cleanup
--postprocess-only Obsidian postprocess without OCR (requires --mode obsidian)
--migrate Copy figures to the vault (requires vault_path configured)
--dry-run Simulate without modifying
--verbose DEBUG logs
--rename Rename images before OCR
--rename-only [N] Rename without running OCR (N = starting number)
--rename-prefix P Rename prefix (default: page)
--chapters NAME... Subfolders to process (in order)
--dir-level Folder-level order for --rename
--max-tokens N Max tokens generated per page (default: 4096)
--n-ctx N KV cache size (context window) (default: 6144)
--n-parallel N Intra-page parallel slots (default: 3)
| Code | Meaning |
|---|---|
| 0 | Full success |
| 1 | Fatal error |
| 2 | Finished with errors on some pages |