Efficient arXiv Search, Download, and AI-Friendly Markdown Parsing.
paper-parser is a CLI tool designed to streamline the academic research workflow. It handles everything from finding a paper on arXiv to converting it into a clean, structured Markdown format that is optimized for LLMs and AI agents.
Standard PDF-to-text tools often produce one massive block of text, which leads to two major problems when working with AI:
- Context Overflow: Large papers can exceed an LLM's context window.
- Token Waste: Paying for the entire paper's context when you only need to analyze the "Methodology" or "Conclusion" is expensive and slow.
The Solution: paper-parser uses the MinerU V4 API to extract high-quality Markdown and then automatically splits the paper into chapters. This allows AI agents to read the paper section-by-section, enabling:
- β Granular Context Management: Only read what matters.
- β Significant Token Savings: Drastically reduce your API bills.
- β Higher Accuracy: Focus the model's attention on specific sections.
- π Intelligent Search: Typos? No problem. Fuzzy-searches arXiv with relevance ranking.
- π₯ Smart Download: Downloads PDFs into organized, ID-based directories.
- π§© Section Splitting: Automatically splits papers into
01_Introduction.md,02_Methodology.md, etc. - π¦ Incremental Processing: Remembers what you've already downloaded and parsedβno redundant API calls.
- πΌοΈ Image Extraction: Extracts images and maintains correct relative links within the Markdown chapters.
- π Note Templates: Automatically generates
title.mdandsummary.mdfor your research notes.
pip install paper-parser-skill==v0.1.3
# Clone the repository git clone https://github.com/KaiHangYang/paper-parser-skill.git cd paper-parser-skill # Install in editable mode pip install -e .
The first time you run pp, it will create a configuration file at ~/.paper-parser/config.yaml.
MINERU_API_TOKEN: "your_token_from_mineru.net" PAPER_WORKSPACE: "~/paper-parser-workspace" MINERU_API_TIMEOUT: 600
Important
You need an API token from MinerU to use the parsing features.
# Search for a paper by keyword or arXiv ID pp search "LLaMA 3" pp search 2303.17564 # Download a paper PDF (cached if already downloaded) pp download 2303.17564 # Find where a paper is stored locally pp path 2303.17564
Warning
pp parse and pp all block until cloud processing completes, which can take several minutes.
For agent/automation use, prefer the async workflow below.
# Parse a local PDF or an arXiv paper already downloaded pp parse 2303.17564 pp parse ./my_local_paper.pdf # Full workflow in one shot: Search β Download β Parse pp all 2303.17564
# Step 1: Submit for parsing and return immediately # β auto-downloads PDF if needed, uploads to MinerU, returns batch_id pp submit 2303.17564 # Step 2: Check status later β downloads results automatically when done pp check 2303.17564 # β "β³ Still processing" or "β Parsing complete!"
Tip
pp submit is idempotent: calling it again on the same paper won't re-upload.
It checks the existing task status and returns the current state instead.
All commands (parse, all, submit, check) share the same .parse_task.json state file,
so you can freely mix sync and async workflows.
PAPER_WORKSPACE/
βββ 2303.17564/ # ArXiv ID
βββ paper.pdf # Original PDF
βββ title.md # Paper metadata
βββ summary.md # Note-taking template
βββ .parse_task.json # Task state (batch_id, status, timestamps)
βββ markdowns/ # AI-Ready Content
βββ 01_Introduction.md
βββ 02_Methods.md
βββ ...
βββ images/ # Extracted figures & tables
- arXiv for the academic paper API.
- RapidFuzz for fast fuzzy string matching.
- MinerU (mineru.net) for high-quality PDF-to-Markdown parsing.