Name	Name	Last commit message	Last commit date
Latest commit History 1 Commit
api	api
core	core
services	services
.gitignore	.gitignore
README.md	README.md
main.py	main.py
requirements.txt	requirements.txt
test.sh	test.sh
trending_searches_IN.json	trending_searches_IN.json

Name

Last commit message

Last commit date

Latest commit

History

api

trending_searches_IN.json

Content Scraper API

A high-performance FastAPI microservice that extracts content from web articles using newspaper3k, providing structured data through a RESTful API.

Features

Article Extraction: Extracts comprehensive article data including:
- Title and main content
- Authors and publication date
- Images and videos
- Meta information (keywords, description, language)
- Additional metadata
Clean Architecture: Modular design with clear separation of concerns
FastAPI Framework: High performance, automatic OpenAPI documentation
Error Handling: Robust error handling for various failure scenarios
Type Safety: Full type hints and Pydantic models for request/response validation

Prerequisites

Python 3.8 or higher
pip package manager

Installation

Clone the repository:

git clone https://github.com/yourusername/content-scraper-api.git
cd content-scraper-api

Install dependencies:

pip install -r requirements.txt

Usage

Starting the server

python main.py

Or using uvicorn directly:

uvicorn main:app --reload

The API will be available at http://localhost:8000

API Endpoints

POST /fetch-article

Fetches and parses an article from the provided URL.

Request:

{
 "url": "https://example.com/news/article"
}

Response:

{
 "url": "https://example.com/news/article",
 "title": "Example Article Title",
 "content": "Article content text...",
 "top_image": "https://example.com/images/top.jpg",
 "authors": ["Author Name"],
 "images": [
 "https://example.com/images/1.jpg",
 "https://example.com/images/2.jpg"
 ],
 "movies": ["https://example.com/videos/1.mp4"]
}

API Documentation

FastAPI automatically generates interactive API documentation:

Swagger UI: http://localhost:8000/docs
ReDoc: http://localhost:8000/redoc

Architecture

The application follows a clean architecture approach with the following components:

API Layer: Handles HTTP requests and responses
Service Layer: Contains the business logic for article extraction
Core: Configuration and shared utilities

Error Handling

The API handles various error scenarios:

Invalid URLs
Unreachable sites
Parsing failures
Server errors

Extending the API

The modular architecture makes it easy to extend the API:

Add new endpoints in api/routes.py
Add new services in the services package
Modify the data models in api/models.py

License

MIT

About

FastAPI RESTful api for extracting clean content and metadata from web articles using newspaper3k.

Resources

Stars

Watchers

Forks

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mr3od/content-scraper-api

Folders and files

Latest commit

History

Repository files navigation

Content Scraper API

Features

Prerequisites

Installation

Usage

Starting the server

API Endpoints

POST /fetch-article

API Documentation

Architecture

Error Handling

Extending the API

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Content Scraper API

Features

Prerequisites

Installation

Usage

Starting the server

API Endpoints

POST /fetch-article

API Documentation

Architecture

Error Handling

Extending the API

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages