Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

DiscovAI/DiscovAI-crawl

Repository files navigation

DiscovAI Crawl API πŸ•·οΈπŸ”

One API to scrape everything you need from URLs for your AI tool and vector database.

🚧 Work in Progress 🚧

🌟 Features

Our API provides a comprehensive suite of data extraction and processing capabilities:

  • 🧼 Clean HTML (JavaScript and CSS removed)
  • πŸ“ LLM-friendly Markdown conversion
  • 🚫 Ad-free, cookie banner-free, and dialog-free content
  • πŸ“Έ Website screenshots (auto-saved to AWS S3 or Cloudflare R2)
  • πŸ€– LLM-generated SEO-friendly content
  • πŸ”‘ LLM-extracted key information (summary, features, FAQs, etc.)
  • 🧠 Ready-to-use embeddings for vector database integration (auto-saved to db)

πŸ”§ Installation

pnpm i
cd apps/api && pnpm exec playwright install

πŸš€ Usage

pnpm dev
open http://localhost:3000

πŸ“¦ API Response Structure

{
 "clean_html": "...",
 "LLM_friendly_markdown": "...",
 "clean_text": "...",
 "screenshot_url": "...",
 "llm_extracts_key_info": {
 "what": "...",
 "summary": "...",
 "features": ["...", "..."],
 "faqs": [{"q": "...", "a": "..."}]
 },
 "llm_summarized_detail": "...",
 "embeddings": [...]
}

πŸ“š Documentation

TODO

🀝 Contributing

TODO

About

πŸ•·οΈ DiscovAI Crawl API(🚧 Work in Progress 🚧): A powerful web scraping solution for AI tools and vector databases. Extract clean HTML, generate LLM-friendly content, and create embeddings from any URL.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

Contributors

AltStyle γ«γ‚ˆγ£γ¦ε€‰ζ›γ•γ‚ŒγŸγƒšγƒΌγ‚Έ (->γ‚ͺγƒͺγ‚ΈγƒŠγƒ«) /