Name	Name	Last commit message	Last commit date
Latest commit History 56 Commits
.github	.github
scrapers	scrapers
test	test
utils	utils
.gitignore	.gitignore
LICENSE	LICENSE
README.md	README.md
api-docs-urls.csv	api-docs-urls.csv
auto-assign.yaml	auto-assign.yaml
package-lock.json	package-lock.json
package.json	package.json
requirements.txt	requirements.txt

UpdAPI 🔧

"Update your knowledge base with the latest API resources"

A free, lightweight tool to streamline the discovery of API documentation, policies, and community resources and enhancing LLMs with accurate, relevant context

License
Contributions Welcome
Build Status
Open Issues

Like the project? Please give it a Star so it can reach more people >>>>> Star on GitHub

⚠️ Under Construction
This project is in the early stages of development and may not function as intended yet. Contributions, feedback, and ideas are highly welcome!

📋 Links to Public API DOCS

api-docs-urls.csv contains a centralized collection of popular APIs with links to their official documentation and associated policies. It includes tools to scrape, preprocess, and update the dataset for better usability and retrieval.

api-docs-urls.csv:

API Name	Official Documentation URL	Privacy Policy URL	Terms of Service URL	Rate Limiting Policy URL	Changelog/Release Notes URL	Security Policy URL	Developer Community/Forum URL
OpenAI API	Documentation	Privacy	Terms	Rate Limits	Changelog	Security	Community
...

⚠️ The URLs are auto-generated and require manual verification
We aim to maintain these URLs to be pointing to the current document (TODO: Set up cron jobs/GitHub Actions to periodically re-run the scrapers and keep the dataset up-to-date)

🛠 Adding More APIs to the Dataset

Option 1: Manually Add to `api-docs-urls.csv`

You can manually add new entries to api-docs-urls.csv with the following format:

API_Name,Official_Documentation_URL,Privacy_Policy_URL,Terms_of_Service_URL,Rate_Limiting_Policy_URL,Changelog_Release_Notes_URL,Security_Policy_URL,Developer_Community_Forum_URL
Example API,https://example.com/docs,https://example.com/privacy,https://example.com/tos,https://example.com/rate-limits,https://example.com/changelog,https://example.com/security,https://example.com/community

Option 2: Combine Multiple CSV Files

If you have additional entries in separate CSV files, use the provided Python utility script to merge them into the main dataset.

Combine CSV Files

Ensure you have Python installed.

Run the script:

python utils/combine_csv.py new_entries.csv api-docs-urls.csv combined_dataset.csv

Replace the existing api-docs-urls.csv with the new combined_dataset.csv.

What can we do with the API URLs?

Use Case 1: You can use the scrapers (fast-scraper.js or accurate-scraper.js) to extract content from API docs and enhance your LLM to provide specific and accurate answers about APIs

Workflow Example:

Retrieve relevant snippets with a custom script / Query the vector database for a user question

Generate Answers with an LLM: Pass the retrieved snippets as context to the LLM (e.g., GPT-4 or LLaMA-2)

from transformers import AutoModelForCausalLM, AutoTokenizer
from faiss import read_index
# Load vector index
index = read_index('vector_index.faiss')
# Query embeddings
user_query = "What are the rate limits for the OpenAI API?"
query_embedding = model.encode(user_query)
_, indices = index.search(np.array([query_embedding]), k=5)
# Retrieve relevant chunks
context = " ".join([documents[i] for i in indices[0]])
# Use an LLM to answer
model = AutoModelForCausalLM.from_pretrained('gpt-4')
tokenizer = AutoTokenizer.from_pretrained('gpt-4')
prompt = f"Context: {context}\nQuestion: {user_query}\nAnswer:"
inputs = tokenizer(prompt, return_tensors='pt')
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Use Case 2: Maintain offline copies of API documentation for scenarios where internet access is unavailable or restricted. Offline access ensures reliability and speed when querying API documentation.

How?

Use the scrapers to generate offline copies of the documentation in JSON, HTML, or Markdown formats.
Serve these copies locally or integrate them into a lightweight desktop or web application.

Use Case 3: API documentation changes frequently, and outdated information can lead to bugs or misconfigurations. Automating change detection ensures your knowledge base remains up-to-date.

How?

Compare the current version of a page with its previously saved version.
Use hashing (e.g., MD5) or diff-checking tools to detect changes in content.

🚀 How to Use the Scrapers

Check Python Version

Recommended Python Versions: Python >=3.7 and <3.10

Check your Python version:
```
python --version
```

If your Python version is incompatible, you can:

Install a compatible version (e.g., Python 3.9).

Use a virtual environment:

python3.9 -m venv venv
source venv/bin/activate # Or venv\Scripts\activate on Windows
pip install -r requirements.txt

Alternatively, use Conda to install PyTorch and its dependencies:

conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

We provide two scraping tools to suit different needs:

fast-scraper.js: A lightweight Cheerio-based scraper for fast retrieval of static content.
accurate-scraper.js: A Playwright-based scraper for handling JavaScript-loaded pages and more dynamic content.

1. `fast-scraper.js` (Cheerio-Based)

Purpose: For quickly scraping static API documentation pages.
Strengths:
- Lightweight and fast.
- Suitable for pages without JavaScript content.
Limitations:
- Does not handle JavaScript-loaded content.

Run the Script

Install dependencies:
```
npm install
```
Run the script:
```
node fast-scraper.js
```
Results will be saved in scraped_data_fast.json.

2. `accurate-scraper.js` (Playwright-Based)

Purpose: For scraping API documentation pages that rely on JavaScript for rendering.
Strengths:
- Handles dynamic content and JavaScript-loaded pages.
- More accurate for modern, interactive documentation sites.
Limitations:
- Slower compared to fast-scraper.js.

Run the Script

Install Playwright:
```
npm install playwright
```
Run the script:
```
node accurate-scraper.js
```
Results will be saved in scraped_data_accurate.json.

💡 How to Contribute

For first time contributors, I recommend you to check out https://github.com/firstcontributions/first-contributions and https://www.youtube.com/watch?v=YaToH3s_-nQ

Contributions are welcome! Here's how you can contribute:

Add API Entries:
- Add new API entries directly to api-docs-urls.csv or via pull request.
- Ensure URLs point to the current version of the documentation and policies.
Verify API Entries:
- Is the URL up-to-date?
- Is the URL root-level for the relevant page? (api.com/docs/, not api.com/docs/nested)
- Is the API doc public and does it comply with "robots.txt"?
- Does the URL provide all the expected information (changelogs, rate limits, etc) ?
- Is there any dynamically loaded page content that the scraper is able to extract?
Improve Scrapers:
- Enhance fast-scraper.js or accurate-scraper.js for better performance and compatibility.
- Add features like advanced error handling or field-specific scraping.
Submit Pull Requests:
- Fork the repository.
- Create a new branch for your changes.
- Submit a pull request for review.

If you're using the scripts, first install dependencies:

npm install
pip install -r requirements.txt

This installs everything listed in package.json and requirements.txt

🚀 Roadmap Features

🔍 Search & Browse: Easily find APIs by keyword or category (e.g., "Machine Learning APIs," "Finance APIs")
📄 Latest API Metadata Retrieval: Retrieve up-to-date API endpoints and parameters, directly from official documentation.
🛠 VS Code Integration: Use the lightweight UpdAPI extension to search and retrieve APIs directly from your terminal.

📜 License

This repository is licensed under the MIT License.

Status

🛠 Current Phase:

Under Construction: We’re building the core MVP features and testing functionality.

📌 Known Issues:

Limited API support.
Some features may not work as expected.

Check the Open Issues for more details.

Roadmap

✅ MVP Goals

Basic search and browse functionality.
JSON exports for select APIs.
Direct links to official API documentation.

🔜 Future Enhancements

IDE integrations (e.g., VS Code plugin).
API update notifications via email/webhooks.
Support for more APIs.

Community

❤️ Acknowledgments

We thank all API providers for publishing robust documentation and fostering developer-friendly ecosystems. Your contributions make projects like this possible! Special thanks to:

Crawlee: A powerful web scraping and crawling library that simplifies the extraction of structured data from websites.
OpenAPI: For setting the standard in API specifications and enabling better interoperability and accessibility.

Questions?

Send emails to support@updapi.com

License

in-c0/updAPI

Folders and files

Latest commit

History

Repository files navigation

UpdAPI 🔧

"Update your knowledge base with the latest API resources"

A free, lightweight tool to streamline the discovery of API documentation, policies, and community resources and enhancing LLMs with accurate, relevant context

📋 Links to Public API DOCS

🛠 Adding More APIs to the Dataset

Option 1: Manually Add to api-docs-urls.csv

Option 2: Combine Multiple CSV Files

Combine CSV Files

What can we do with the API URLs?

🚀 How to Use the Scrapers

Check Python Version

1. fast-scraper.js (Cheerio-Based)

Run the Script

2. accurate-scraper.js (Playwright-Based)

Run the Script

💡 How to Contribute

🚀 Roadmap Features

📜 License

Status

🛠 Current Phase:

📌 Known Issues:

Roadmap

✅ MVP Goals

🔜 Future Enhancements

Community

❤️ Acknowledgments

Questions?

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 7

Uh oh!

Languages

Option 1: Manually Add to `api-docs-urls.csv`

1. `fast-scraper.js` (Cheerio-Based)

2. `accurate-scraper.js` (Playwright-Based)

Packages