raintree-technology/docpull

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
.github		.github
docs		docs
mcp		mcp
src/docpull		src/docpull
tests		tests
web		web
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

Repository files navigation

docpull

Pull documentation from any website and convert it to clean, AI-ready Markdown.

Python 3.10+ PyPI version Downloads License: MIT

docpull demo

Install

pip install docpull

Usage

# Basic fetch
docpull https://docs.example.com
# With options
docpull https://aptos.dev --max-pages 100 --output-dir ./docs
# Filter paths
docpull https://docs.example.com --include-paths "/api/*" --exclude-paths "/changelog/*"
# Enable caching for incremental updates
docpull https://docs.example.com --cache
# JavaScript-heavy sites
pip install docpull[js]
docpull https://spa-site.com --js

Profiles

docpull https://site.com --profile rag # Optimized for RAG/LLM (default)
docpull https://site.com --profile mirror # Full site archive with caching
docpull https://site.com --profile quick # Fast sampling (50 pages, depth 2)

Options

Crawl:
 --max-pages N Maximum pages to fetch
 --max-depth N Maximum crawl depth
 --include-paths P Only crawl matching URL patterns
 --exclude-paths P Skip matching URL patterns
 --js Enable JavaScript rendering
Cache:
 --cache Enable caching for incremental updates
 --cache-dir DIR Cache directory (default: .docpull-cache)
 --cache-ttl DAYS Days before cache expires (default: 30)
Content:
 --streaming-dedup Real-time duplicate detection
 --language CODE Filter by language (e.g., en)
Output:
 --output-dir, -o DIR Output directory (default: ./docs)
 --dry-run Show what would be fetched
 --verbose, -v Verbose output

See docpull --help for all options.

Python API

import asyncio
from docpull import Fetcher, DocpullConfig, ProfileName, EventType
async def main():
 config = DocpullConfig(
 url="https://docs.example.com",
 profile=ProfileName.RAG,
 crawl={"max_pages": 100},
 cache={"enabled": True},
 )
 async with Fetcher(config) as fetcher:
 async for event in fetcher.run():
 if event.type == EventType.FETCH_PROGRESS:
 print(f"{event.current}/{event.total}: {event.url}")
 print(f"Done: {fetcher.stats.pages_fetched} pages")
asyncio.run(main())

Output

Each page becomes a Markdown file with YAML frontmatter:

---
title: "Getting Started"
source: https://docs.example.com/guide
---
# Getting Started
...

Security

HTTPS-only, mandatory robots.txt compliance
Blocks private/internal network IPs
Path traversal and XXE protection

Troubleshooting

docpull --doctor # Check installation
docpull URL --verbose # Verbose output
docpull URL --dry-run # Test without downloading

License

MIT

About

A PyPI package to pull documentation from any website and converts it into clean, AI-ready Markdown. Fast, type-safe, secure, and optimized for building knowledge bases or training datasets.

docpull.raintree.technology/

Contributing

Security policy

Activity

Custom properties

Stars

20 stars

Watchers

1 watching

Forks

0 forks

Report repository

Releases 7

v2.2.0: Resume, Auth, JSON/SQLite output Latest

Dec 15, 2025

+ 6 releases

Packages

No packages published

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

raintree-technology/docpull

Folders and files

Latest commit

History

Repository files navigation

docpull

Install

Usage

Profiles

Options

Python API

Output

Security

Troubleshooting

Links

License

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages

Contributors 3

Uh oh!

Languages

License

raintree-technology/docpull

Folders and files

Latest commit

History

Repository files navigation

docpull

Install

Usage

Profiles

Options

Python API

Output

Security

Troubleshooting

Links

License

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages 0

Contributors 3

Uh oh!

Languages

Packages