A free, lightweight tool to streamline the discovery of API documentation, policies, and community resources and enhancing LLMs with accurate, relevant context
License
Contributions Welcome
Build Status
Open Issues
Like the project? Please give it a Star so it can reach more people >>>>> Star on GitHub
β οΈ Under Construction
This project is in the early stages of development and may not function as intended yet. Contributions, feedback, and ideas are highly welcome!
api-docs-urls.csv contains a centralized collection of popular APIs with links to their official documentation and associated policies. It includes tools to scrape, preprocess, and update the dataset for better usability and retrieval.
api-docs-urls.csv:
| API Name | Official Documentation URL | Privacy Policy URL | Terms of Service URL | Rate Limiting Policy URL | Changelog/Release Notes URL | Security Policy URL | Developer Community/Forum URL |
|---|---|---|---|---|---|---|---|
| OpenAI API | Documentation | Privacy | Terms | Rate Limits | Changelog | Security | Community |
| ... |
β οΈ The URLs are auto-generated and require manual verification
We aim to maintain these URLs to be pointing to the current document (TODO: Set up cron jobs/GitHub Actions to periodically re-run the scrapers and keep the dataset up-to-date)
You can manually add new entries to api-docs-urls.csv with the following format:
API_Name,Official_Documentation_URL,Privacy_Policy_URL,Terms_of_Service_URL,Rate_Limiting_Policy_URL,Changelog_Release_Notes_URL,Security_Policy_URL,Developer_Community_Forum_URL
Example API,https://example.com/docs,https://example.com/privacy,https://example.com/tos,https://example.com/rate-limits,https://example.com/changelog,https://example.com/security,https://example.com/community
If you have additional entries in separate CSV files, use the provided Python utility script to merge them into the main dataset.
- Ensure you have Python installed.
- Run the script:
python utils/combine_csv.py new_entries.csv api-docs-urls.csv combined_dataset.csv
- Replace the existing
api-docs-urls.csvwith the newcombined_dataset.csv.
Use Case 1: You can use the scrapers (fast-scraper.js or accurate-scraper.js) to extract content from API docs and enhance your LLM to provide specific and accurate answers about APIs
Workflow Example:
-
Retrieve relevant snippets with a custom script / Query the vector database for a user question
-
Generate Answers with an LLM: Pass the retrieved snippets as context to the LLM (e.g., GPT-4 or LLaMA-2)
from transformers import AutoModelForCausalLM, AutoTokenizer from faiss import read_index # Load vector index index = read_index('vector_index.faiss') # Query embeddings user_query = "What are the rate limits for the OpenAI API?" query_embedding = model.encode(user_query) _, indices = index.search(np.array([query_embedding]), k=5) # Retrieve relevant chunks context = " ".join([documents[i] for i in indices[0]]) # Use an LLM to answer model = AutoModelForCausalLM.from_pretrained('gpt-4') tokenizer = AutoTokenizer.from_pretrained('gpt-4') prompt = f"Context: {context}\nQuestion: {user_query}\nAnswer:" inputs = tokenizer(prompt, return_tensors='pt') outputs = model.generate(**inputs, max_new_tokens=200) print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Use Case 2: Maintain offline copies of API documentation for scenarios where internet access is unavailable or restricted. Offline access ensures reliability and speed when querying API documentation.
How?
- Use the scrapers to generate offline copies of the documentation in JSON, HTML, or Markdown formats.
- Serve these copies locally or integrate them into a lightweight desktop or web application.
Use Case 3: API documentation changes frequently, and outdated information can lead to bugs or misconfigurations. Automating change detection ensures your knowledge base remains up-to-date.
How?
- Compare the current version of a page with its previously saved version.
- Use hashing (e.g., MD5) or diff-checking tools to detect changes in content.
Recommended Python Versions: Python >=3.7 and <3.10
- Check your Python version:
python --version
- If your Python version is incompatible, you can:
- Install a compatible version (e.g., Python 3.9).
- Use a virtual environment:
python3.9 -m venv venv source venv/bin/activate # Or venv\Scripts\activate on Windows pip install -r requirements.txt
- Alternatively, use Conda to install PyTorch and its dependencies:
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
We provide two scraping tools to suit different needs:
fast-scraper.js: A lightweight Cheerio-based scraper for fast retrieval of static content.accurate-scraper.js: A Playwright-based scraper for handling JavaScript-loaded pages and more dynamic content.
- Purpose: For quickly scraping static API documentation pages.
- Strengths:
- Lightweight and fast.
- Suitable for pages without JavaScript content.
- Limitations:
- Does not handle JavaScript-loaded content.
- Install dependencies:
npm install
- Run the script:
node fast-scraper.js
- Results will be saved in
scraped_data_fast.json.
- Purpose: For scraping API documentation pages that rely on JavaScript for rendering.
- Strengths:
- Handles dynamic content and JavaScript-loaded pages.
- More accurate for modern, interactive documentation sites.
- Limitations:
- Slower compared to
fast-scraper.js.
- Slower compared to
- Install Playwright:
npm install playwright
- Run the script:
node accurate-scraper.js
- Results will be saved in
scraped_data_accurate.json.
For first time contributors, I recommend you to check out https://github.com/firstcontributions/first-contributions and https://www.youtube.com/watch?v=YaToH3s_-nQ
Contributions are welcome! Here's how you can contribute:
-
Add API Entries:
- Add new API entries directly to
api-docs-urls.csvor via pull request. - Ensure URLs point to the current version of the documentation and policies.
- Add new API entries directly to
-
Verify API Entries:
- Is the URL up-to-date?
- Is the URL root-level for the relevant page? (
api.com/docs/, notapi.com/docs/nested) - Is the API doc public and does it comply with "robots.txt"?
- Does the URL provide all the expected information (changelogs, rate limits, etc) ?
- Is there any dynamically loaded page content that the scraper is able to extract?
-
Improve Scrapers:
- Enhance
fast-scraper.jsoraccurate-scraper.jsfor better performance and compatibility. - Add features like advanced error handling or field-specific scraping.
- Enhance
-
Submit Pull Requests:
- Fork the repository.
- Create a new branch for your changes.
- Submit a pull request for review.
If you're using the scripts, first install dependencies:
npm install pip install -r requirements.txt
This installs everything listed in package.json and requirements.txt
- π Search & Browse: Easily find APIs by keyword or category (e.g., "Machine Learning APIs," "Finance APIs")
- π Latest API Metadata Retrieval: Retrieve up-to-date API endpoints and parameters, directly from official documentation.
- π VS Code Integration: Use the lightweight UpdAPI extension to search and retrieve APIs directly from your terminal.
This repository is licensed under the MIT License.
- Under Construction: Weβre building the core MVP features and testing functionality.
- Limited API support.
- Some features may not work as expected.
Check the Open Issues for more details.
- Basic search and browse functionality.
- JSON exports for select APIs.
- Direct links to official API documentation.
- IDE integrations (e.g., VS Code plugin).
- API update notifications via email/webhooks.
- Support for more APIs.
We thank all API providers for publishing robust documentation and fostering developer-friendly ecosystems. Your contributions make projects like this possible! Special thanks to:
- Crawlee: A powerful web scraping and crawling library that simplifies the extraction of structured data from websites.
- OpenAPI: For setting the standard in API specifications and enabling better interoperability and accessibility.
Send emails to support@updapi.com