Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings
/ pdfx Public

๐Ÿš€ Fast, intelligent PDF converter powered by Vision Language Models Convert PDFs to Markdown, HTML, JSON, and more with Apple Silicon optimization - Powered by Docling

License

Notifications You must be signed in to change notification settings

jaydotsee/pdfx

Repository files navigation

pdfx - PDF to Markdown Converter

A Python CLI tool for converting PDF documents to Markdown, optimized for Apple Silicon with MLX acceleration.

Note: This is a user-friendly wrapper around Docling, providing a command-line interface, YAML configuration, and batch processing capabilities.

Python 3.10-3.12 License: MIT

โœจ Features

  • ๐Ÿš€ Fast PDF Conversion - Vision Language Model (VLM) based processing with MLX acceleration
  • ๐ŸŽ Apple Silicon Optimized - Native MPS (Metal Performance Shaders) support
  • ๐Ÿ“ฆ Batch Processing - Convert entire directories while preserving structure
  • ๐ŸŽจ Multiple Output Formats - Markdown, JSON, HTML, and DocTags
  • ๐Ÿ” OCR Support - Extract text from scanned documents
  • ๐Ÿ“Š Table Extraction - Intelligent table structure recognition
  • ๐Ÿงฎ Formula Support - Extract mathematical formulas as LaTeX
  • ๐Ÿ–ผ๏ธ Image Handling - Embed images as base64 or save separately
  • โš™๏ธ YAML Configuration - Easy customization with config files
  • ๐ŸŒ URL Support - Convert PDFs directly from URLs
  • ๐ŸŽฏ Interactive Mode - Choose output location with a visual menu

๐Ÿ“‹ Requirements

  • macOS with Apple Silicon (M1/M2/M3/M4) - recommended for best performance
  • Python 3.10 - 3.12 (Python 3.13+ not yet supported by Docling)
  • uv package manager (recommended) or pip

๐Ÿš€ Quick Start

Installation

# 1. Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# 2. Clone or navigate to project directory
cd pdfx
# 3. Create virtual environment
uv venv --python 3.12
source .venv/bin/activate
# 4. Install package in development mode
uv pip install -e .
# 5. Download required models (one-time setup, ~500MB-1GB)
mkdir -p ~/.cache/docling/models
docling-tools models download
# 6. Verify installation
python verify_install.py
pdfx --help

Basic Usage

# Convert a single PDF
pdfx input.pdf
# Interactive mode (choose output location)
pdfx input.pdf -i
# Convert from URL
pdfx https://arxiv.org/pdf/2408.09869
# Convert a directory
pdfx ~/Documents/pdfs/
# List available models
pdfx --list-models
# Verbose output for debugging
pdfx input.pdf --verbose

๐Ÿ“– Usage

Interactive Mode

Use the -i flag to select output location from a menu:

pdfx document.pdf -i

You'll see:

============================================================
๐Ÿ“ Select output location:
============================================================
1. Default (~/Downloads/)
2. Current directory (./output)
3. Same directory as source
4. Desktop (~/Desktop/)
5. Custom path...
============================================================
Enter your choice (1-5) or press Enter for default:

Command-Line Options

pdfx [OPTIONS] input
Options:
 -h, --help Show help message
 -c, --config CONFIG Path to config file (default: config.yaml)
 -o, --output OUTPUT Output directory (overrides config)
 -f, --format FORMAT Output format: markdown, json, html, doctags
 -v, --verbose Enable verbose logging
 -i, --interactive Prompt for output location
 --list-models List available VLM models

Configuration

Create or edit config.yaml to customize behavior:

# Model Configuration
model:
 # Pipeline type: vlm (fast) or standard (full features)
 pipeline_type: "standard"
 # VLM model (Apple Silicon optimized)
 vlm_model: "SMOLDOCLING_MLX"
# Output Configuration
output:
 # Format: markdown, json, html, doctags
 # Can be single format or list for multiple outputs
 format: ["markdown"] # or ["markdown", "json"]
 # Image handling
 include_images: true
 image_mode: "embedded" # or "referenced"
# Processing Options
processing:
 # OCR for scanned documents
 enable_ocr: false
 ocr_engine: "auto"
 # Performance tuning
 page_batch_size: 8 # Higher = faster but more memory
# Feature Toggles (standard pipeline only)
features:
 # Table extraction
 table_structure: true
 table_mode: "ACCURATE" # or "FAST"
 # Content enrichment
 formula_enrichment: true
 code_enrichment: true
 picture_classification: true

๐ŸŽฏ Common Use Cases

Academic Papers

Extract formulas and tables with high accuracy:

# config.yaml
model:
 pipeline_type: "standard"
features:
 formula_enrichment: true
 table_structure: true
 table_mode: "ACCURATE"
pdfx research_paper.pdf

Scanned Documents

Enable OCR for image-based PDFs:

# config.yaml
processing:
 enable_ocr: true
 ocr_engine: "auto"
 page_batch_size: 2
pdfx scanned_document.pdf

Batch Processing

Convert entire directories:

# Convert all PDFs in a directory
pdfx ~/Documents/reports/
# With interactive output selection
pdfx ~/Documents/reports/ -i

Multiple Output Formats

Export to both Markdown and JSON:

# config.yaml
output:
 format: ["markdown", "json"]

This creates both .md and .json files for each PDF.

๐Ÿ“Š Pipeline Comparison

VLM Pipeline (Fast)

Best for: Simple documents, speed priority

model:
 pipeline_type: "vlm"
  • โšก Fastest processing (~1 second/page)
  • ๐ŸŽ Apple Silicon optimized
  • โš ๏ธ Limited features (no OCR, table extraction, or enrichments)

Standard Pipeline (Full Features)

Best for: Complex documents, tables, formulas

model:
 pipeline_type: "standard"
  • โœ… Full feature support
  • ๐Ÿ“Š Table structure recognition
  • ๐Ÿงฎ Formula extraction
  • ๐Ÿ” OCR support
  • โฑ๏ธ Slower but more accurate

๐Ÿ› ๏ธ Troubleshooting

Models Not Found

Download models manually:

mkdir -p ~/.cache/docling/models
docling-tools models download

Python Version Issues

Ensure you're using Python 3.10-3.12:

python --version
# If wrong version:
uv venv --python 3.12
source .venv/bin/activate

Out of Memory

Reduce batch size in config:

processing:
 page_batch_size: 2 # or 1 for very large files

Images Not Embedding

Ensure correct configuration:

output:
 include_images: true
 image_mode: "embedded"

Empty Table Columns

This may occur if:

  • Table contains images/icons instead of text
  • Complex table structure
  • Try JSON export to see raw extracted data:
output:
 format: ["markdown", "json"]

OCR Not Working

  1. Enable OCR in config
  2. Install OCR dependencies:
uv pip install easyocr

For additional help:

๐Ÿ—๏ธ Project Structure

pdfx/
โ”œโ”€โ”€ src/
โ”‚ โ””โ”€โ”€ pdfx/
โ”‚ โ”œโ”€โ”€ __init__.py
โ”‚ โ”œโ”€โ”€ cli.py # CLI entry point
โ”‚ โ”œโ”€โ”€ config.py # Configuration management
โ”‚ โ””โ”€โ”€ converter.py # Core conversion logic
โ”œโ”€โ”€ tests/ # Test suite
โ”œโ”€โ”€ examples/
โ”‚ โ””โ”€โ”€ config.yaml # Example configuration
โ”œโ”€โ”€ config.yaml # Default configuration
โ”œโ”€โ”€ pyproject.toml # Package configuration
โ”œโ”€โ”€ requirements.txt # Dependencies
โ”œโ”€โ”€ verify_install.py # Installation verification
โ””โ”€โ”€ README.md # This file

๐Ÿ”ง Development

Running Tests

pytest
# With coverage
pytest --cov=pdfx --cov-report=html
# Verify installation
python verify_install.py

Package Installation

# Development mode (editable)
uv pip install -e .
# Or install dependencies manually
uv pip install -r requirements.txt

๐Ÿ“ฆ Dependencies

Core:

  • docling>=2.0.0 - PDF processing engine
  • mlx-vlm>=0.1.0 - Apple Silicon acceleration
  • pyyaml>=6.0 - Configuration parsing
  • docling-core - Core document types

Optional:

  • easyocr - OCR engine for scanned documents
  • rapidocr - Alternative lightweight OCR
  • pytesseract - Tesseract OCR wrapper

Model Downloads:

  • First run downloads ~500MB-1GB to ~/.cache/docling/models
  • SmolDocling MLX: ~250MB
  • Supporting models: ~200-400MB

โšก Performance

Apple Silicon (MLX):

  • Simple PDFs: ~1 second/page
  • Complex PDFs with tables: ~2 seconds/page
  • With OCR: ~3-5 seconds/page

Memory Usage:

  • Base (models loaded): ~500MB
  • Per page batch (4 pages): ~200-400MB
  • Peak (batch_size=8): ~1.5GB

๐Ÿ™ Credits

This project is built on top of Docling by IBM Research.

Docling provides:

  • Advanced PDF understanding using Vision Language Models
  • Layout analysis and table structure recognition
  • Formula extraction and code block detection
  • Multiple export formats

This wrapper adds:

  • User-friendly CLI interface
  • YAML-based configuration
  • Batch processing capabilities
  • Interactive output selection
  • Apple Silicon optimization out of the box

Related Projects

  • Docling - Core PDF processing library
  • MLX - Apple's ML framework for Apple Silicon
  • SmolDocling - Lightweight VLM model

๐Ÿ“š Resources

๐Ÿ“„ License

MIT License - see LICENSE file for details.

This project uses Docling (MIT License). See individual package licenses for dependencies.

๐Ÿค Contributing

Contributions welcome! Please ensure:

  • Code follows existing style
  • Config changes are documented
  • Tests pass with various PDF types
  • Update CHANGELOG.md with your changes

๐Ÿ“ฎ Support

For issues:


Made with โค๏ธ using Docling

About

๐Ÿš€ Fast, intelligent PDF converter powered by Vision Language Models Convert PDFs to Markdown, HTML, JSON, and more with Apple Silicon optimization - Powered by Docling

Resources

License

Stars

Watchers

Forks

Packages

No packages published

AltStyle ใซใ‚ˆใฃใฆๅค‰ๆ›ใ•ใ‚ŒใŸใƒšใƒผใ‚ธ (->ใ‚ชใƒชใ‚ธใƒŠใƒซ) /