Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

A PyPI package to pull documentation from any website and converts it into clean, AI-ready Markdown. Fast, type-safe, secure, and optimized for building knowledge bases or training datasets.

License

Notifications You must be signed in to change notification settings

raintree-technology/docpull

Repository files navigation

docpull

Pull documentation from any website and convert it to clean, AI-ready Markdown.

Python 3.10+ PyPI version Downloads License: MIT

docpull demo

Install

pip install docpull

Usage

# Basic fetch
docpull https://docs.example.com
# With options
docpull https://aptos.dev --max-pages 100 --output-dir ./docs
# Filter paths
docpull https://docs.example.com --include-paths "/api/*" --exclude-paths "/changelog/*"
# Enable caching for incremental updates
docpull https://docs.example.com --cache
# JavaScript-heavy sites
pip install docpull[js]
docpull https://spa-site.com --js

Profiles

docpull https://site.com --profile rag # Optimized for RAG/LLM (default)
docpull https://site.com --profile mirror # Full site archive with caching
docpull https://site.com --profile quick # Fast sampling (50 pages, depth 2)

Options

Crawl:
 --max-pages N Maximum pages to fetch
 --max-depth N Maximum crawl depth
 --include-paths P Only crawl matching URL patterns
 --exclude-paths P Skip matching URL patterns
 --js Enable JavaScript rendering
Cache:
 --cache Enable caching for incremental updates
 --cache-dir DIR Cache directory (default: .docpull-cache)
 --cache-ttl DAYS Days before cache expires (default: 30)
Content:
 --streaming-dedup Real-time duplicate detection
 --language CODE Filter by language (e.g., en)
Output:
 --output-dir, -o DIR Output directory (default: ./docs)
 --dry-run Show what would be fetched
 --verbose, -v Verbose output

See docpull --help for all options.

Python API

import asyncio
from docpull import Fetcher, DocpullConfig, ProfileName, EventType
async def main():
 config = DocpullConfig(
 url="https://docs.example.com",
 profile=ProfileName.RAG,
 crawl={"max_pages": 100},
 cache={"enabled": True},
 )
 async with Fetcher(config) as fetcher:
 async for event in fetcher.run():
 if event.type == EventType.FETCH_PROGRESS:
 print(f"{event.current}/{event.total}: {event.url}")
 print(f"Done: {fetcher.stats.pages_fetched} pages")
asyncio.run(main())

Output

Each page becomes a Markdown file with YAML frontmatter:

---
title: "Getting Started"
source: https://docs.example.com/guide
---
# Getting Started
...

Security

  • HTTPS-only, mandatory robots.txt compliance
  • Blocks private/internal network IPs
  • Path traversal and XXE protection

Troubleshooting

docpull --doctor # Check installation
docpull URL --verbose # Verbose output
docpull URL --dry-run # Test without downloading

Links

License

MIT

About

A PyPI package to pull documentation from any website and converts it into clean, AI-ready Markdown. Fast, type-safe, secure, and optimized for building knowledge bases or training datasets.

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

AltStyle によって変換されたページ (->オリジナル) /