Name	Name	Last commit message	Last commit date
Latest commit History 1 Commit
examples	examples
src/pdfx	src/pdfx
tests	tests
.gitignore	.gitignore
CHANGELOG.md	CHANGELOG.md
LICENSE	LICENSE
README.md	README.md
config.yaml	config.yaml
config.yaml.example	config.yaml.example
pyproject.toml	pyproject.toml
requirements.txt	requirements.txt
test-installation.sh	test-installation.sh
verify_install.py	verify_install.py

pdfx - PDF to Markdown Converter

A Python CLI tool for converting PDF documents to Markdown, optimized for Apple Silicon with MLX acceleration.

Note: This is a user-friendly wrapper around Docling, providing a command-line interface, YAML configuration, and batch processing capabilities.

Python 3.10-3.12 License: MIT

✨ Features

🚀 Fast PDF Conversion - Vision Language Model (VLM) based processing with MLX acceleration
🍎 Apple Silicon Optimized - Native MPS (Metal Performance Shaders) support
📦 Batch Processing - Convert entire directories while preserving structure
🎨 Multiple Output Formats - Markdown, JSON, HTML, and DocTags
🔍 OCR Support - Extract text from scanned documents
📊 Table Extraction - Intelligent table structure recognition
🧮 Formula Support - Extract mathematical formulas as LaTeX
🖼️ Image Handling - Embed images as base64 or save separately
⚙️ YAML Configuration - Easy customization with config files
🌐 URL Support - Convert PDFs directly from URLs
🎯 Interactive Mode - Choose output location with a visual menu

📋 Requirements

macOS with Apple Silicon (M1/M2/M3/M4) - recommended for best performance
Python 3.10 - 3.12 (Python 3.13+ not yet supported by Docling)
uv package manager (recommended) or pip

🚀 Quick Start

Installation

# 1. Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# 2. Clone or navigate to project directory
cd pdfx
# 3. Create virtual environment
uv venv --python 3.12
source .venv/bin/activate
# 4. Install package in development mode
uv pip install -e .
# 5. Download required models (one-time setup, ~500MB-1GB)
mkdir -p ~/.cache/docling/models
docling-tools models download
# 6. Verify installation
python verify_install.py
pdfx --help

Basic Usage

# Convert a single PDF
pdfx input.pdf
# Interactive mode (choose output location)
pdfx input.pdf -i
# Convert from URL
pdfx https://arxiv.org/pdf/2408.09869
# Convert a directory
pdfx ~/Documents/pdfs/
# List available models
pdfx --list-models
# Verbose output for debugging
pdfx input.pdf --verbose

📖 Usage

Interactive Mode

Use the -i flag to select output location from a menu:

pdfx document.pdf -i

You'll see:

============================================================
📁 Select output location:
============================================================
1. Default (~/Downloads/)
2. Current directory (./output)
3. Same directory as source
4. Desktop (~/Desktop/)
5. Custom path...
============================================================
Enter your choice (1-5) or press Enter for default:

Command-Line Options

pdfx [OPTIONS] input
Options:
 -h, --help Show help message
 -c, --config CONFIG Path to config file (default: config.yaml)
 -o, --output OUTPUT Output directory (overrides config)
 -f, --format FORMAT Output format: markdown, json, html, doctags
 -v, --verbose Enable verbose logging
 -i, --interactive Prompt for output location
 --list-models List available VLM models

Configuration

Create or edit config.yaml to customize behavior:

# Model Configuration
model:
 # Pipeline type: vlm (fast) or standard (full features)
 pipeline_type: "standard"
 # VLM model (Apple Silicon optimized)
 vlm_model: "SMOLDOCLING_MLX"
# Output Configuration
output:
 # Format: markdown, json, html, doctags
 # Can be single format or list for multiple outputs
 format: ["markdown"] # or ["markdown", "json"]
 # Image handling
 include_images: true
 image_mode: "embedded" # or "referenced"
# Processing Options
processing:
 # OCR for scanned documents
 enable_ocr: false
 ocr_engine: "auto"
 # Performance tuning
 page_batch_size: 8 # Higher = faster but more memory
# Feature Toggles (standard pipeline only)
features:
 # Table extraction
 table_structure: true
 table_mode: "ACCURATE" # or "FAST"
 # Content enrichment
 formula_enrichment: true
 code_enrichment: true
 picture_classification: true

🎯 Common Use Cases

Academic Papers

Extract formulas and tables with high accuracy:

# config.yaml
model:
 pipeline_type: "standard"
features:
 formula_enrichment: true
 table_structure: true
 table_mode: "ACCURATE"

pdfx research_paper.pdf

Scanned Documents

Enable OCR for image-based PDFs:

# config.yaml
processing:
 enable_ocr: true
 ocr_engine: "auto"
 page_batch_size: 2

pdfx scanned_document.pdf

Batch Processing

Convert entire directories:

# Convert all PDFs in a directory
pdfx ~/Documents/reports/
# With interactive output selection
pdfx ~/Documents/reports/ -i

Multiple Output Formats

Export to both Markdown and JSON:

# config.yaml
output:
 format: ["markdown", "json"]

This creates both .md and .json files for each PDF.

📊 Pipeline Comparison

VLM Pipeline (Fast)

Best for: Simple documents, speed priority

model:
 pipeline_type: "vlm"

⚡ Fastest processing (~1 second/page)
🍎 Apple Silicon optimized
⚠️ Limited features (no OCR, table extraction, or enrichments)

Standard Pipeline (Full Features)

Best for: Complex documents, tables, formulas

model:
 pipeline_type: "standard"

✅ Full feature support
📊 Table structure recognition
🧮 Formula extraction
🔍 OCR support
⏱️ Slower but more accurate

🛠️ Troubleshooting

Models Not Found

Download models manually:

mkdir -p ~/.cache/docling/models
docling-tools models download

Python Version Issues

Ensure you're using Python 3.10-3.12:

python --version
# If wrong version:
uv venv --python 3.12
source .venv/bin/activate

Out of Memory

Reduce batch size in config:

processing:
 page_batch_size: 2 # or 1 for very large files

Images Not Embedding

Ensure correct configuration:

output:
 include_images: true
 image_mode: "embedded"

Empty Table Columns

This may occur if:

Table contains images/icons instead of text
Complex table structure
Try JSON export to see raw extracted data:

output:
 format: ["markdown", "json"]

OCR Not Working

Enable OCR in config
Install OCR dependencies:

uv pip install easyocr

For additional help:

Run with --verbose flag for detailed logging
Check Docling documentation
Review Docling GitHub issues

🏗️ Project Structure

pdfx/
├── src/
│ └── pdfx/
│ ├── __init__.py
│ ├── cli.py # CLI entry point
│ ├── config.py # Configuration management
│ └── converter.py # Core conversion logic
├── tests/ # Test suite
├── examples/
│ └── config.yaml # Example configuration
├── config.yaml # Default configuration
├── pyproject.toml # Package configuration
├── requirements.txt # Dependencies
├── verify_install.py # Installation verification
└── README.md # This file

🔧 Development

Running Tests

pytest
# With coverage
pytest --cov=pdfx --cov-report=html
# Verify installation
python verify_install.py

Package Installation

# Development mode (editable)
uv pip install -e .
# Or install dependencies manually
uv pip install -r requirements.txt

📦 Dependencies

Core:

docling>=2.0.0 - PDF processing engine
mlx-vlm>=0.1.0 - Apple Silicon acceleration
pyyaml>=6.0 - Configuration parsing
docling-core - Core document types

Optional:

easyocr - OCR engine for scanned documents
rapidocr - Alternative lightweight OCR
pytesseract - Tesseract OCR wrapper

Model Downloads:

First run downloads ~500MB-1GB to ~/.cache/docling/models
SmolDocling MLX: ~250MB
Supporting models: ~200-400MB

⚡ Performance

Apple Silicon (MLX):

Simple PDFs: ~1 second/page
Complex PDFs with tables: ~2 seconds/page
With OCR: ~3-5 seconds/page

Memory Usage:

Base (models loaded): ~500MB
Per page batch (4 pages): ~200-400MB
Peak (batch_size=8): ~1.5GB

🙏 Credits

This project is built on top of Docling by IBM Research.

Docling provides:

Advanced PDF understanding using Vision Language Models
Layout analysis and table structure recognition
Formula extraction and code block detection
Multiple export formats

This wrapper adds:

User-friendly CLI interface
YAML-based configuration
Batch processing capabilities
Interactive output selection
Apple Silicon optimization out of the box

Related Projects

Docling - Core PDF processing library
MLX - Apple's ML framework for Apple Silicon
SmolDocling - Lightweight VLM model

📚 Resources

📄 License

MIT License - see LICENSE file for details.

This project uses Docling (MIT License). See individual package licenses for dependencies.

🤝 Contributing

Contributions welcome! Please ensure:

Code follows existing style
Config changes are documented
Tests pass with various PDF types
Update CHANGELOG.md with your changes

📮 Support

For issues:

This tool: Open an issue on GitHub
Docling: Docling GitHub Issues
uv package manager: uv GitHub Issues

Made with ❤️ using Docling

License

jaydotsee/pdfx

Folders and files

Latest commit

History

Repository files navigation