Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

rasata/liteparse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

697 Commits

Repository files navigation

LiteParse

CI | Crates.io version | npm version | wasm version | PyPI version | License | Docs

English | 简体中文

out

Looking for LiteParse V1? Follow this link to the old code

LiteParse is a standalone OSS PDF parsing tool focused exclusively on fast and light parsing. It provides high-quality spatial text parsing with bounding boxes, without proprietary LLM features or cloud dependencies. Everything runs locally on your machine.

Hitting the limits of local parsing? For complex documents (dense tables, multi-column layouts, charts, handwritten text, or scanned PDFs), you'll get significantly better results with LlamaParse, our cloud-based document parser built for production document pipelines. LlamaParse handles the hard stuff so your models see clean, structured data and markdown.

Sign up for LlamaParse free

Overview

  • Fast Text Parsing: Spatial text parsing using PDFium
  • Flexible OCR System:
    • Built-in: Tesseract (zero setup, bundled with the library)
    • HTTP Servers: Plug in any OCR server (EasyOCR, PaddleOCR, custom)
    • Standard API: Simple, well-defined OCR API specification
  • Screenshot Generation: Generate high-quality page screenshots for LLM agents
  • Multiple Output Formats: JSON and Text
  • Bounding Boxes: Precise text positioning information
  • Multi-language: Use from Rust, Node.js/TypeScript, Python, or the browser (WASM)
  • Multi-platform: Linux, macOS (Intel/ARM), Windows
flowchart LR
 subgraph Input["Input Formats"]
 direction TB
 PDF["PDF"]
 DOCX["DOCX"]
 XLSX["XLSX"]
 PPTX["PPTX"]
 IMG["Images"]
 end
 subgraph Core["Rust Core"]
 direction TB
 CONV["Format Conversion\nLibreOffice / ImageMagick"]
 EXTRACT["Text Extraction\nPDFium C library"]
 OCR["Selective OCR\nTesseract / HTTP / Custom"]
 MERGE["OCR Merge\nNative text + OCR results"]
 PROJ["Grid Projection\nSpatial layout reconstruction"]
 CONV --> EXTRACT
 EXTRACT --> OCR --> MERGE --> PROJ
 EXTRACT --> MERGE
 end
 subgraph Output[" Output "]
 direction TB
 JSON["Structured JSON\ntext + bounding boxes"]
 TEXT["Plain Text\nlayout-preserved"]
 SCREEN["Screenshots\nPNG rendering"]
 end
 subgraph Bindings["Language Bindings"]
 direction TB
 NAPI["Node.js / TypeScript\nnapi-rs"]
 PYO3["Python\nPyO3"]
 WASM["Browser / WASM\nwasm-bindgen"]
 CLI["CLI\ncargo / npm / pip"]
 NAPI ~~~ PYO3 ~~~ WASM ~~~ CLI
 end
 PDF --> EXTRACT
 DOCX & XLSX & PPTX & IMG --> CONV
 PROJ --> JSON & TEXT & SCREEN
 JSON & TEXT & SCREEN --> Bindings
 style Input fill:#F5F5F5,color:#000000,stroke:#37D7FA,stroke-width:2px
 style Core fill:#F5F5F5,color:#000000,stroke:#3E18F9,stroke-width:2px
 style Output fill:#F5F5F5,color:#000000,stroke:#FF8705,stroke-width:2px
 style Bindings fill:#F5F5F5,color:#000000,stroke:#FF8DF2,stroke-width:2px
 style PDF fill:#96E7F9,color:#000000,stroke:#37D7FA,stroke-width:1px
 style DOCX fill:#96E7F9,color:#000000,stroke:#37D7FA,stroke-width:1px
 style XLSX fill:#96E7F9,color:#000000,stroke:#37D7FA,stroke-width:1px
 style PPTX fill:#96E7F9,color:#000000,stroke:#37D7FA,stroke-width:1px
 style IMG fill:#96E7F9,color:#000000,stroke:#37D7FA,stroke-width:1px
 style CONV fill:#92AEFF,color:#000000,stroke:#4B72FE,stroke-width:1px
 style EXTRACT fill:#92AEFF,color:#000000,stroke:#4B72FE,stroke-width:1px
 style OCR fill:#92AEFF,color:#000000,stroke:#4B72FE,stroke-width:1px
 style MERGE fill:#92AEFF,color:#000000,stroke:#4B72FE,stroke-width:1px
 style PROJ fill:#4B72FE,color:#FFFFFF,stroke:#3E18F9,stroke-width:2px
 style JSON fill:#FFBD74,color:#000000,stroke:#FF8705,stroke-width:1px
 style TEXT fill:#FFBD74,color:#000000,stroke:#FF8705,stroke-width:1px
 style SCREEN fill:#FFBD74,color:#000000,stroke:#FF8705,stroke-width:1px
 style NAPI fill:#FFBFF8,color:#000000,stroke:#FF8DF2,stroke-width:1px
 style PYO3 fill:#FFBFF8,color:#000000,stroke:#FF8DF2,stroke-width:1px
 style WASM fill:#FFBFF8,color:#000000,stroke:#FF8DF2,stroke-width:1px
 style CLI fill:#FFBFF8,color:#000000,stroke:#FF8DF2,stroke-width:1px
Loading

Installation

Install via your preferred package manager. All versions (except WASM) ship with the same lit CLI.

Language Install Library Docs
Node.js / TypeScript npm i @llamaindex/liteparse Node.js README
Python pip install liteparse Python README
Rust cargo install liteparse (CLI) / cargo add liteparse (lib) Rust README (crates.io)
Browser (WASM) npm i @llamaindex/liteparse-wasm WASM README

Agent Skill

You can use liteparse as an agent skill, downloading it with the skills CLI tool:

npx skills add run-llama/llamaparse-agent-skills --skill liteparse

Or copy-pasting the SKILL.md file to your own skills setup.

CLI Usage

The CLI is the same across all installations (npm, pip, cargo install).

Parse Files

# Basic parsing
lit parse document.pdf
# Parse with specific format
lit parse document.pdf --format json -o output.json
# Parse specific pages
lit parse document.pdf --target-pages "1-5,10,15-20"
# Parse without OCR
lit parse document.pdf --no-ocr
# Parse a remote PDF
curl -sL https://example.com/report.pdf | lit parse -

Batch Parsing

Parse an entire directory of documents:

lit batch-parse ./input-directory ./output-directory

Generate Screenshots

Screenshots are essential for LLM agents to extract visual information that text alone cannot capture.

# Screenshot all pages
lit screenshot document.pdf -o ./screenshots
# Screenshot specific pages
lit screenshot document.pdf --target-pages "1,3,5" -o ./screenshots
# Custom DPI
lit screenshot document.pdf --dpi 300 -o ./screenshots

CLI Reference

Parse Command

lit parse [OPTIONS] <file>
Options:
 -o, --output <file> Output file path
 --format <format> Output format: json|text [default: text]
 --no-ocr Disable OCR
 --ocr-language <lang> OCR language, Tesseract format [default: eng]
 --ocr-server-url <url> HTTP OCR server URL (uses Tesseract if not provided)
 --tessdata-path <path> Path to tessdata directory
 --max-pages <n> Max pages to parse [default: 1000]
 --target-pages <pages> Pages to parse (e.g., "1-5,10,15-20")
 --dpi <dpi> Rendering DPI [default: 150]
 --preserve-small-text Keep very small text
 --password <password> Password for encrypted documents
 --num-workers <n> Concurrent OCR workers [default: CPU cores - 1]
 -q, --quiet Suppress progress output
 -h, --help Print help

Batch Parse Command

lit batch-parse [OPTIONS] <input-dir> <output-dir>
Options:
 --format <format> Output format: json|text [default: text]
 --no-ocr Disable OCR
 --ocr-language <lang> OCR language [default: eng]
 --ocr-server-url <url> HTTP OCR server URL
 --tessdata-path <path> Path to tessdata directory
 --max-pages <n> Max pages per file [default: 1000]
 --dpi <dpi> Rendering DPI [default: 150]
 --recursive Recursively search input directory
 --extension <ext> Only process files with this extension (e.g., ".pdf")
 --password <password> Password for encrypted documents
 --num-workers <n> Concurrent OCR workers
 -q, --quiet Suppress progress output
 -h, --help Print help

Screenshot Command

lit screenshot [OPTIONS] <file>
Options:
 -o, --output-dir <dir> Output directory [default: ./screenshots]
 --target-pages <pages> Pages to screenshot (e.g., "1,3,5" or "1-5")
 --dpi <dpi> Rendering DPI [default: 150]
 --password <password> Password for encrypted documents
 -q, --quiet Suppress progress output
 -h, --help Print help

OCR Setup

Default: Tesseract

Tesseract is bundled and works out of the box:

lit parse document.pdf # OCR enabled by default
lit parse document.pdf --ocr-language fra # Specify language
lit parse document.pdf --no-ocr # Disable OCR

For offline or air-gapped environments, set TESSDATA_PREFIX to a directory containing pre-downloaded .traineddata files:

export TESSDATA_PREFIX=/path/to/tessdata
lit parse document.pdf --ocr-language eng

Or pass the path directly:

lit parse document.pdf --tessdata-path /path/to/tessdata

Optional: HTTP OCR Servers

For higher accuracy or better performance, you can use an HTTP OCR server. We provide ready-to-use example wrappers for popular OCR engines:

You can integrate any OCR service by implementing the simple LiteParse OCR API specification (see OCR_API_SPEC.md).

The API requires:

  • POST /ocr endpoint
  • Accepts file and language parameters
  • Returns JSON: { results: [{ text, bbox: [x1,y1,x2,y2], confidence }] }

Multi-Format Input Support

LiteParse supports automatic conversion of various document formats to PDF before parsing.

Supported Input Formats

Office Documents (via LibreOffice)

  • Word: .doc, .docx, .docm, .odt, .rtf, .pages
  • PowerPoint: .ppt, .pptx, .pptm, .odp, .key
  • Spreadsheets: .xls, .xlsx, .xlsm, .ods, .csv, .tsv, .numbers

Install LibreOffice for automatic conversion:

# macOS
brew install --cask libreoffice
# Ubuntu/Debian
apt-get install libreoffice
# Windows
choco install libreoffice-fresh

On Windows, you may need to add LibreOffice's program directory (usually C:\Program Files\LibreOffice\program) to your PATH.

Images (via ImageMagick)

  • Formats: .jpg, .jpeg, .png, .gif, .bmp, .tiff, .webp, .svg

Install ImageMagick for image-to-PDF conversion:

# macOS
brew install imagemagick
# Ubuntu/Debian
apt-get install imagemagick
# Windows
choco install imagemagick.app

Environment Variables

Variable Description
TESSDATA_PREFIX Path to a directory containing Tesseract .traineddata files. Used for offline/air-gapped environments.

Development

The project is a Rust workspace with the core library and language-specific binding crates.

crates/
├── liteparse/ # Core library + CLI binary
├── liteparse-napi/ # Node.js bindings (napi-rs)
├── liteparse-python/ # Python bindings (PyO3)
├── liteparse-wasm/ # WASM bindings (wasm-bindgen)
├── pdfium/ # PDFium Rust wrapper
└── pdfium-sys/ # PDFium FFI bindings
packages/
├── node/ # npm package (TS wrapper + native binary)
├── python/ # PyPI package (Python wrapper + native binary)
└── wasm/ # WASM npm package

Building

# Build the CLI
cargo build --release -p liteparse
# Build Node.js bindings
cd packages/node && npm run build
# Build Python bindings
cd packages/python && maturin develop --release
# Build WASM
cd packages/wasm && npm run build

We provide a fairly rich AGENTS.md/CLAUDE.md that we recommend using to help with development + coding agents.

License

Apache 2.0

Credits

Built on top of:

About

A fast, helpful, and open-source document parser

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

Contributors

Languages

  • Rust 71.1%
  • Python 16.8%
  • Shell 4.2%
  • JavaScript 3.5%
  • HTML 2.6%
  • TypeScript 1.3%
  • Other 0.5%

AltStyle によって変換されたページ (->オリジナル) /