Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Releases: neoncapy/doc2md

v3.5.0 — Marker as Default Extractor

01 Mar 00:06
@neoncapy neoncapy

Choose a tag to compare

What's New in v3.5.0

Marker is now the default PDF extractor

The pipeline now uses marker-pdf as the default extractor for digital PDFs. Marker produces significantly better output for academic papers — especially multi-column layouts, complex tables, and mathematical notation.

New fallback chain: marker → docling → pymupdf4llm → mineru → tesseract

Extractor Used for
marker (new default) Digital PDFs
docling Scanned PDFs
pymupdf4llm Fallback
MinerU Complex layouts
tesseract Last resort OCR

New: Step 3b — Image analysis after deferred extraction

Some extractors (like marker) defer image extraction to a later pipeline step. Previously, this meant prepare-image-analysis.py would skip because no image manifest existed yet. Now Step 3b automatically re-runs image analysis preparation after Step 6c creates the manifest. This unblocks AI expert persona descriptions for all extractor paths.

New file: convert-paper-marker.py

Standalone marker wrapper (~630 lines) with:

  • Page-count-based timeout (scales with document size)
  • CPU retry with configurable timeout
  • YAML title/author enrichment via fitz
  • Journal name title filtering
  • Hyphen-compound preservation

Quality & reliability fixes

  • --no-images flag now fully respected across all extractor paths (was leaking through for marker)
  • run_command() timeout parameter — outer timeout guard prevents infinite hangs
  • Quality gate fallback — when an extractor exits 0 but produces critically empty output, the pipeline automatically falls back to the next extractor AND re-checks quality
  • Registry lockingfcntl.flock prevents corruption from concurrent pipeline runs
  • Checkpoint recovery — quality gate fallback now correctly updates checkpoint state for crash recovery
  • Type safetyfigure_num handling works with mixed int/string values from split-panel images

Upgrade

cd your-doc2md-directory
git pull origin main
pip install marker-pdf # new required dependency
pip install symspellpy wordsegment # optional, improves post-processing

Full QC history

All changes went through adversarial QC loops (fix → QC → fix → QC) until zero issues at all severity levels:

  • Deferred audit fixes: 2 QC rounds → 0/0/0
  • Step 3b + manifest fixes: 3 QC rounds → 0/0/0
  • Integration tested on academic papers end-to-end
Assets 2
Loading

AltStyle によって変換されたページ (->オリジナル) /