A robust backend system for aggregating and processing articles from various sources (RSS, Zhihu, etc.) with AI-powered summarization and tagging.
- Backend: Python 3.10, FastAPI
- Database: PostgreSQL
- Cache/Queue: Redis, Celery
- ORM: SQLAlchemy
- AI: LangChain, OpenAI
- Crawling: feedparser, httpx, trafilatura
- Containerization: Docker, docker-compose
InfoHub/
├── app/
│ ├── main.py # FastAPI entry point
│ ├── core/ # Configuration & dependencies
│ ├── db/ # Database setup
│ ├── models/ # SQLAlchemy models
│ ├── schemas/ # Pydantic schemas
│ ├── crud/ # Database operations
│ ├── crawlers/ # Source crawlers
│ ├── services/ # AI processing
│ ├── workers/ # Celery tasks
│ └── utils/ # Utilities
├── docker-compose.yml
├── Dockerfile
├── requirements.txt
└── .env.example
Copy the example environment file and configure your settings:
cp .env.example .env
Edit .env and add your OpenAI API key:
DATABASE_URL=postgresql+psycopg2://user:password@localhost:5432/infohub REDIS_URL=redis://localhost:6379/0 OPENAI_API_KEY=sk-your-actual-api-key SECRET_KEY=your-secret-key-here ENVIRONMENT=dev
Start all services with docker-compose:
docker-compose up -d
This will start:
- Web: FastAPI server on http://localhost:8000
- Worker: Celery background worker
- DB: PostgreSQL on port 5432
- Redis: Redis on port 6379
Install dependencies:
pip install -r requirements.txt
Start PostgreSQL and Redis (or use docker-compose for just these services):
docker-compose up -d db redis
Run the FastAPI server:
uvicorn app.main:app --reload
Run the Celery worker (in a separate terminal):
celery -A app.workers.tasks worker --loglevel=info
GET http://localhost:8000/
-
Create Source:
POST /api/v1/sources{ "platform": "rss", "identity": "https://example.com/rss", "is_active": true } -
List Sources:
GET /api/v1/sources -
Get Source:
GET /api/v1/sources/{source_id} -
Trigger Crawl:
POST /api/v1/sources/{source_id}/crawl
-
List Articles:
GET /api/v1/articles?status=processed -
Get Article:
GET /api/v1/articles/{article_id} -
Create Article:
POST /api/v1/articles
GET http://localhost:8000/api/v1/crawl?source_url=https://example.com/rss&platform=rss
- Multi-Platform Crawling: Support for RSS feeds and extensible to other platforms
- AI Processing: Automatic summarization, tagging, and quality scoring
- Async Task Queue: Celery-based background processing
- Duplicate Detection: Prevents storing duplicate articles
- Clean Content: HTML to Markdown conversion for better readability
- RESTful API: Well-structured API with FastAPI
- Database Migrations: SQLAlchemy ORM with PostgreSQL
id: Primary keytitle: Article titleauthor: Author namesource_url: Original URL (unique)content: Cleaned Markdown contentsummary: AI-generated summarytags: AI-extracted tagsai_score: Quality score (0-10)status: Processing status (pending, processed, failed)created_at: Creation timestampupdated_at: Last update timestamp
id: Primary keyplatform: Platform type (rss, zhihu, etc.)identity: URL or user identifieris_active: Active statuslast_crawled_at: Last crawl timestamp
pytest
Using Alembic (to be set up):
alembic init alembic
alembic revision --autogenerate -m "Initial migration"
alembic upgrade head- Check Redis connection:
docker-compose logs redis - Check worker logs:
docker-compose logs worker - Verify tasks are registered:
celery -A app.workers.tasks inspect registered
- Ensure PostgreSQL is running:
docker-compose logs db - Check DATABASE_URL in
.envmatches docker-compose configuration
- Verify OPENAI_API_KEY is set correctly
- Check API quota/billing status
- Review worker logs for detailed error messages
MIT License
Contributions are welcome! Please feel free to submit a Pull Request.