Name	Name	Last commit message	Last commit date
Latest commit History 15 Commits
.github/workflows	.github/workflows
paper_parser	paper_parser
.gitignore	.gitignore
LICENSE	LICENSE
README.md	README.md
SKILL.md	SKILL.md
config-example.yaml	config-example.yaml
pyproject.toml	pyproject.toml

Paper Parser 🛠️

Efficient arXiv Search, Download, and AI-Friendly Markdown Parsing.

paper-parser is a CLI tool designed to streamline the academic research workflow. It handles everything from finding a paper on arXiv to converting it into a clean, structured Markdown format that is optimized for LLMs and AI agents.

🚀 Why Use Paper Parser?

Standard PDF-to-text tools often produce one massive block of text, which leads to two major problems when working with AI:

Context Overflow: Large papers can exceed an LLM's context window.
Token Waste: Paying for the entire paper's context when you only need to analyze the "Methodology" or "Conclusion" is expensive and slow.

The Solution: paper-parser uses the MinerU V4 API to extract high-quality Markdown and then automatically splits the paper into chapters. This allows AI agents to read the paper section-by-section, enabling:

✅ Granular Context Management: Only read what matters.
✅ Significant Token Savings: Drastically reduce your API bills.
✅ Higher Accuracy: Focus the model's attention on specific sections.

✨ Key Features

🔍 Intelligent Search: Typos? No problem. Fuzzy-searches arXiv with relevance ranking.
📥 Smart Download: Downloads PDFs into organized, ID-based directories.
🧩 Section Splitting: Automatically splits papers into 01_Introduction.md, 02_Methodology.md, etc.
📦 Incremental Processing: Remembers what you've already downloaded and parsed—no redundant API calls.
🖼️ Image Extraction: Extracts images and maintains correct relative links within the Markdown chapters.
📝 Note Templates: Automatically generates title.md and summary.md for your research notes.

🛠️ Installation

From PyPI (Recommended)

pip install paper-parser-skill==v0.1.3

From Source

# Clone the repository
git clone https://github.com/KaiHangYang/paper-parser-skill.git
cd paper-parser-skill
# Install in editable mode
pip install -e .

⚙️ Configuration

The first time you run pp, it will create a configuration file at ~/.paper-parser/config.yaml.

MINERU_API_TOKEN: "your_token_from_mineru.net"
PAPER_WORKSPACE: "~/paper-parser-workspace"
MINERU_API_TIMEOUT: 600

Important

You need an API token from MinerU to use the parsing features.

📖 Usage Guide

Basic Commands

# Search for a paper by keyword or arXiv ID
pp search "LLaMA 3"
pp search 2303.17564
# Download a paper PDF (cached if already downloaded)
pp download 2303.17564
# Find where a paper is stored locally
pp path 2303.17564

Parsing — Synchronous (Blocking ⚠️)

Warning

pp parse and pp all block until cloud processing completes, which can take several minutes. For agent/automation use, prefer the async workflow below.

# Parse a local PDF or an arXiv paper already downloaded
pp parse 2303.17564
pp parse ./my_local_paper.pdf
# Full workflow in one shot: Search → Download → Parse
pp all 2303.17564

Parsing — Async (Recommended for Agents ✅)

# Step 1: Submit for parsing and return immediately
# → auto-downloads PDF if needed, uploads to MinerU, returns batch_id
pp submit 2303.17564
# Step 2: Check status later — downloads results automatically when done
pp check 2303.17564
# → "⏳ Still processing" or "✅ Parsing complete!"

Tip

pp submit is idempotent: calling it again on the same paper won't re-upload. It checks the existing task status and returns the current state instead. All commands (parse, all, submit, check) share the same .parse_task.json state file, so you can freely mix sync and async workflows.

📂 Output Structure

PAPER_WORKSPACE/
└── 2303.17564/ # ArXiv ID
 ├── paper.pdf # Original PDF
 ├── title.md # Paper metadata
 ├── summary.md # Note-taking template
 ├── .parse_task.json # Task state (batch_id, status, timestamps)
 └── markdowns/ # AI-Ready Content
 ├── 01_Introduction.md
 ├── 02_Methods.md
 ├── ...
 └── images/ # Extracted figures & tables

🤝 Acknowledgments

arXiv for the academic paper API.
RapidFuzz for fast fuzzy string matching.
MinerU (mineru.net) for high-quality PDF-to-Markdown parsing.

📜 License

MIT

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KaiHangYang/paper-parser-skill

Folders and files

Latest commit

History

Repository files navigation

Paper Parser 🛠️

🚀 Why Use Paper Parser?

✨ Key Features

🛠️ Installation

From PyPI (Recommended)

From Source

⚙️ Configuration

📖 Usage Guide

Basic Commands

Parsing — Synchronous (Blocking ⚠️)

Parsing — Async (Recommended for Agents ✅)

📂 Output Structure

🤝 Acknowledgments

📜 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Paper Parser 🛠️

🚀 Why Use Paper Parser?

✨ Key Features

🛠️ Installation

From PyPI (Recommended)

From Source

⚙️ Configuration

📖 Usage Guide

Basic Commands

Parsing — Synchronous (Blocking ⚠️)

Parsing — Async (Recommended for Agents ✅)

📂 Output Structure

🤝 Acknowledgments

📜 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages