Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

ArkNill/newswatch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

7 Commits

Repository files navigation

newswatch

한국어 문서 · llms.txt

News monitoring pipeline — collect RSS feeds, extract full articles, search by meaning, and track page changes. Built entirely from QuartzUnit libraries.

flowchart LR
 A["🔗 feedkit\n444 RSS feeds"] -->|"article URLs"| B["📄 markgrab\nHTML → markdown"]
 B -->|"markdown files"| C["🔍 embgrep\nsemantic index"]
 C -->|"tracked pages"| D["📊 diffgrab\nchange detection"]
Loading

Quick Start

pip install newswatch
# Subscribe to tech feeds from the built-in catalog
newswatch setup -c technology
# Run the full pipeline: collect → extract → index
newswatch run
# Search collected articles by meaning
newswatch search "kubernetes scaling strategies"

What It Does

  1. Collect — Subscribes to RSS/Atom feeds via feedkit (444 curated feeds built-in)
  2. Extract — Fetches full article content via markgrab (HTML → clean markdown)
  3. Index — Builds a local semantic search index via embgrep (embedding-powered, no API keys)
  4. Track — Monitors pages for changes via diffgrab (structured diffs)

No cloud services, no API keys. Everything runs locally.

CLI

newswatch setup

Subscribe to feeds.

newswatch setup -c technology # all 68 tech feeds
newswatch setup -c science -c finance # multiple categories
newswatch setup -f https://example.com/rss # individual URL

newswatch run

Run the full pipeline.

newswatch run # collect → extract → index
newswatch run -n 100 # extract up to 100 articles
newswatch run -t https://example.com # also track this page for changes

Output:

Running newswatch pipeline...
 Pipeline Results
┌─────────────────────┬────────┐
│ Step │ Result │
├─────────────────────┼────────┤
│ Feeds collected │ 62 │
│ New articles │ 418 │
│ Articles extracted │ 50 │
│ Articles indexed │ 50 │
└─────────────────────┴────────┘

newswatch search

Semantic search across collected articles.

newswatch search "AI regulation in Europe"
newswatch search "supply chain attacks" -n 10

Python API

import asyncio
from newswatch import NewsPipeline
async def main():
 pipeline = NewsPipeline()
 # Subscribe to feeds
 await pipeline.setup(categories=["technology", "science"])
 # Run full pipeline
 result = await pipeline.run(extract_limit=100)
 print(f"{result.articles_new} new, {result.articles_indexed} indexed")
 # Semantic search
 results = pipeline.search("quantum computing breakthroughs")
 for r in results:
 print(f" [{r['score']}] {r['text'][:80]}")
 pipeline.close()
asyncio.run(main())

How It Works

flowchart TD
 A["🔗 feedkit\n444 curated feeds"] -->|"article URLs"| B["📄 markgrab\nHTML → clean markdown\nhttpx → Playwright fallback"]
 B -->|"markdown files"| C["🔍 embgrep\nEmbed chunks → SQLite vector index\nSmart chunking · heading-level"]
 C -->|"indexed articles"| D["📊 diffgrab\nTrack pages for changes\nStructured diffs + section analysis"]
 style A fill:#1a1a2e,stroke:#e94560,color:#fff
 style B fill:#1a1a2e,stroke:#0f3460,color:#fff
 style C fill:#1a1a2e,stroke:#533483,color:#fff
 style D fill:#1a1a2e,stroke:#e94560,color:#fff
Loading

Configuration

Data is stored in ~/.newswatch/ by default:

~/.newswatch/
├── feeds.db # feedkit subscriptions + articles
├── index.db # embgrep semantic index
├── tracker.db # diffgrab snapshots
└── extracted/ # markgrab markdown output

Custom location:

pipeline = NewsPipeline(db_dir="/path/to/data")

QuartzUnit Libraries Used

Library Role in newswatch PyPI
feedkit RSS/Atom feed collection (444 curated feeds) pip install feedkit
markgrab URL → LLM-ready markdown extraction pip install markgrab
embgrep Local semantic search (fastembed + SQLite) pip install embgrep
diffgrab Web page change tracking + structured diffs pip install diffgrab

License

MIT


Part of the QuartzUnit ecosystem — composable Python libraries for data collection, extraction, search, and AI agent safety.

About

News monitoring pipeline — feedkit + markgrab + embgrep + diffgrab showcase

Topics

Resources

License

Stars

Watchers

Forks

Packages

Contributors

Languages

AltStyle によって変換されたページ (->オリジナル) /