News monitoring pipeline — collect RSS feeds, extract full articles, search by meaning, and track page changes. Built entirely from QuartzUnit libraries.
flowchart LR
A["🔗 feedkit\n444 RSS feeds"] -->|"article URLs"| B["📄 markgrab\nHTML → markdown"]
B -->|"markdown files"| C["🔍 embgrep\nsemantic index"]
C -->|"tracked pages"| D["📊 diffgrab\nchange detection"]
pip install newswatch # Subscribe to tech feeds from the built-in catalog newswatch setup -c technology # Run the full pipeline: collect → extract → index newswatch run # Search collected articles by meaning newswatch search "kubernetes scaling strategies"
- Collect — Subscribes to RSS/Atom feeds via feedkit (444 curated feeds built-in)
- Extract — Fetches full article content via markgrab (HTML → clean markdown)
- Index — Builds a local semantic search index via embgrep (embedding-powered, no API keys)
- Track — Monitors pages for changes via diffgrab (structured diffs)
No cloud services, no API keys. Everything runs locally.
Subscribe to feeds.
newswatch setup -c technology # all 68 tech feeds newswatch setup -c science -c finance # multiple categories newswatch setup -f https://example.com/rss # individual URL
Run the full pipeline.
newswatch run # collect → extract → index newswatch run -n 100 # extract up to 100 articles newswatch run -t https://example.com # also track this page for changes
Output:
Running newswatch pipeline...
Pipeline Results
┌─────────────────────┬────────┐
│ Step │ Result │
├─────────────────────┼────────┤
│ Feeds collected │ 62 │
│ New articles │ 418 │
│ Articles extracted │ 50 │
│ Articles indexed │ 50 │
└─────────────────────┴────────┘
Semantic search across collected articles.
newswatch search "AI regulation in Europe" newswatch search "supply chain attacks" -n 10
import asyncio from newswatch import NewsPipeline async def main(): pipeline = NewsPipeline() # Subscribe to feeds await pipeline.setup(categories=["technology", "science"]) # Run full pipeline result = await pipeline.run(extract_limit=100) print(f"{result.articles_new} new, {result.articles_indexed} indexed") # Semantic search results = pipeline.search("quantum computing breakthroughs") for r in results: print(f" [{r['score']}] {r['text'][:80]}") pipeline.close() asyncio.run(main())
flowchart TD
A["🔗 feedkit\n444 curated feeds"] -->|"article URLs"| B["📄 markgrab\nHTML → clean markdown\nhttpx → Playwright fallback"]
B -->|"markdown files"| C["🔍 embgrep\nEmbed chunks → SQLite vector index\nSmart chunking · heading-level"]
C -->|"indexed articles"| D["📊 diffgrab\nTrack pages for changes\nStructured diffs + section analysis"]
style A fill:#1a1a2e,stroke:#e94560,color:#fff
style B fill:#1a1a2e,stroke:#0f3460,color:#fff
style C fill:#1a1a2e,stroke:#533483,color:#fff
style D fill:#1a1a2e,stroke:#e94560,color:#fff
Data is stored in ~/.newswatch/ by default:
~/.newswatch/
├── feeds.db # feedkit subscriptions + articles
├── index.db # embgrep semantic index
├── tracker.db # diffgrab snapshots
└── extracted/ # markgrab markdown output
Custom location:
pipeline = NewsPipeline(db_dir="/path/to/data")
| Library | Role in newswatch | PyPI |
|---|---|---|
| feedkit | RSS/Atom feed collection (444 curated feeds) | pip install feedkit |
| markgrab | URL → LLM-ready markdown extraction | pip install markgrab |
| embgrep | Local semantic search (fastembed + SQLite) | pip install embgrep |
| diffgrab | Web page change tracking + structured diffs | pip install diffgrab |
Part of the QuartzUnit ecosystem — composable Python libraries for data collection, extraction, search, and AI agent safety.