Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

utility scripts full and recursive site downloads relinking urls

Notifications You must be signed in to change notification settings

aeonbridge/ABDoc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

4 Commits

Repository files navigation

πŸ•ΈοΈ ABDoc - Website Documentation Archiver

ABDoc Banner License Python Platform

ABDoc is a comprehensive website documentation downloader and archiver developed by AeonBridge Co. This professional-grade toolkit enables you to recursively download entire websites or documentation sites for offline access, knowledge preservation, compliance auditing, and strategic content archival.


πŸš€ Key Features

  • πŸ”„ Recursive Website Downloading - Complete site mirroring with all assets
  • πŸ“ Dual-Format Output - HTML preservation + Markdown conversion
  • πŸ”— Smart Link Conversion - Automatic offline navigation setup
  • πŸ–₯️ Local Server Generation - One-click local hosting with Python HTTP server
  • 🌐 Cross-Platform Support - Linux, macOS, and Windows (WSL)
  • ⚑ High-Quality Conversion - Powered by IBM's docling library
  • πŸ›‘οΈ Enterprise-Ready - Built for compliance, auditing, and knowledge management
  • πŸ€– Intelligent Crawling - Respectful delays, retry logic, and user-agent simulation

πŸ“¦ What's Included

Core Tools

Tool Purpose Best For
ab_downloader_html2md.sh Full site download + Markdown conversion Documentation archival
ab_download_only.sh HTML-only site mirroring Quick offline access
simple_downloader.py Python fallback downloader Cross-platform compatibility
website_downloader.sh Streamlined wget wrapper Simple site downloads

Sample Downloads Included

  • πŸ“– Evolution API v2 Documentation (scripts/wget_/samples/evolution_v2/)
  • πŸ”§ n8n Workflow Documentation (scripts/wget_/samples/n8n/)

πŸ› οΈ Installation

Prerequisites

# System requirements
- Bash shell (Linux/macOS/WSL)
- wget (primary download engine)
- Python 3.12+ (for server and fallback functionality)

Quick Setup

# Clone the repository
git clone https://github.com/aeonbridge/ABDoc.git
cd ABDoc
# Install Python dependencies (for HTML→Markdown conversion)
uv install
# OR
pip install docling requests beautifulsoup4
# Make scripts executable
chmod +x scripts/*.sh
chmod +x scripts/wget_/*.sh

🎯 Usage

Method 1: Complete Site Archival (HTML + Markdown)

# Download and convert to Markdown
./scripts/wget_/ab_downloader_html2md.sh <URL> <output_folder>
# Example: Archive Django documentation
./scripts/wget_/ab_downloader_html2md.sh https://docs.djangoproject.com/ django_docs

Output Structure:

django_docs/
β”œβ”€β”€ html/ # Complete HTML mirror
β”‚ └── docs.djangoproject.com/
β”œβ”€β”€ md/ # Markdown conversions
β”‚ β”œβ”€β”€ index.md
β”‚ β”œβ”€β”€ tutorial.md
β”‚ └── ...
└── launch_server.py

Method 2: HTML-Only Download

# Quick HTML mirroring
./scripts/wget_/ab_download_only.sh <URL>
# Example: Mirror React documentation
./scripts/wget_/ab_download_only.sh https://react.dev/

Method 3: Python Fallback (Cross-Platform)

# When wget is unavailable
python src/simple_downloader.py

Method 4: Interactive Downloader

# Guided download process
./scripts/sd.sh
# OR specify URL directly
./scripts/sd.sh https://your-target-site.com

🌐 Local Server Access

After downloading, start the local server:

cd <output_directory>
python launch_server.py
# OR
./launch.sh
# Then visit: http://localhost:8000

🎨 Advanced Configuration

Custom wget Parameters

Edit the scripts to modify download behavior:

# Common modifications in ab_download_only.sh
--wait=2 # Increase delay between requests
--limit-rate=200k # Limit download speed
--accept="html,css,js,png" # Only download specific file types
--exclude-directories=admin # Skip certain directories

HTML→Markdown Conversion

The ab_downloader_html2md.sh script uses IBM's docling library for high-quality conversion:

# Customize docling behavior in the script
docling <html_file> --to=markdown --output=<md_file>

πŸ“‹ Supported Websites

βœ… Excellent Support:

  • Technical documentation sites (GitBook, MkDocs, Sphinx)
  • API documentation (OpenAPI, REST docs)
  • Knowledge bases and wikis
  • Static content sites

⚠️ Limited Support:

  • Single-page applications (SPA)
  • JavaScript-heavy dynamic sites
  • Authentication-required content
  • Streaming or real-time content

πŸ”§ Troubleshooting

Common Issues

wget not found:

# Install wget
# Ubuntu/Debian: sudo apt-get install wget
# macOS: brew install wget
# Or use Python fallback: python src/simple_downloader.py

Permission denied:

chmod +x scripts/*.sh
chmod +x scripts/wget_/*.sh

Large sites timing out:

# Increase timeout in script
--timeout=60
--tries=5

πŸ’‘ Use Cases

  • πŸ“š Documentation Archival - Preserve technical knowledge for offline access
  • πŸ” Compliance & Auditing - Archive content for regulatory requirements
  • πŸŽ“ Training & Education - Create offline training materials
  • 🏒 Enterprise Knowledge Management - Centralize critical documentation
  • πŸ”¬ Research & Analysis - Systematic content collection and analysis
  • 🌐 Digital Preservation - Long-term content preservation initiatives

🀝 Contributing

We welcome contributions! Here's how you can help:

  1. πŸ› Report Bugs - Open an issue with detailed reproduction steps
  2. πŸ’‘ Suggest Features - Share your ideas for new functionality
  3. πŸ”§ Submit Pull Requests - Contribute code improvements
  4. πŸ“– Improve Documentation - Help us make ABDoc more accessible

Development Setup

git clone https://github.com/aeonbridge/ABDoc.git
cd ABDoc
uv install --dev
pre-commit install # If using pre-commit hooks

πŸ“„ License & Disclaimers

License

This project is licensed under the MIT License - see the LICENSE file for details.

⚠️ Important Disclaimers

PROVIDED "AS IS" - NO EXTENDED SUPPORT

  • βœ… Open Source & Free - Use freely for any purpose
  • ❌ No Warranty - Software provided without any guarantees
  • ❌ Limited Support - Community-driven support only
  • ❌ No SLA - No service level commitments
  • βš–οΈ User Responsibility - Ensure compliance with target site terms of service
  • πŸ”’ Respect Robots.txt - Consider ethical crawling practices

Responsible Usage

  • Respect Copyright - Only download content you have permission to access
  • Follow Terms of Service - Comply with target website policies
  • Be Respectful - Use appropriate delays and don't overload servers
  • Legal Compliance - Ensure your use case complies with applicable laws

🏒 About AeonBridge Co.

Strategic Innovation in Data, Knowledge & Artificial Intelligence
Transforming information into intelligence and connecting you to the future.

AeonBridge Co. is dedicated to developing innovative solutions for knowledge management, data intelligence, and AI-powered automation. ABDoc represents our commitment to open-source tools that empower organizations to preserve and leverage their critical information assets.

🌐 Learn More: www.aeonbridge.co


πŸ“ž Community & Support

  • πŸ“§ General Questions - Open a GitHub Discussion
  • πŸ› Bug Reports - Create a GitHub Issue
  • πŸ’‘ Feature Requests - Submit via GitHub Issues
  • 🀝 Enterprise Inquiries - Contact AeonBridge Co. directly

πŸ™ Acknowledgments

Special thanks to:

  • IBM Research - For the excellent docling library
  • GNU wget team - For the powerful downloading engine
  • Python community - For the robust ecosystem
  • Open Source Contributors - For inspiration and best practices

🌟 Star this repository if ABDoc helps you preserve knowledge! 🌟

Star History Chart


Last updated: 2024 | Version: 1.0.0 | Maintained by AeonBridge Co.

About

utility scripts full and recursive site downloads relinking urls

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

AltStyle γ«γ‚ˆγ£γ¦ε€‰ζ›γ•γ‚ŒγŸγƒšγƒΌγ‚Έ (->γ‚ͺγƒͺγ‚ΈγƒŠγƒ«) /