Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

choism4/websource

Repository files navigation

websource

Node.js License: MIT

Turn websites into reusable structured data sources — through conversation, not code.

websource is a local-first CLI tool that analyzes a URL, detects extractable fields (title, price, image, date...), and generates a reusable extraction config that runs on demand or on a schedule. All data stays on your machine in SQLite.

Quick start

git clone https://github.com/2fe2000/websource.git
cd websource
npm install
npx playwright install chromium
# Interactive setup wizard
npx tsx bin/websource.ts init https://books.toscrape.com

Commands

Command Description
init [url] Guided setup for a new data source
scan <url> Analyze a page without saving
sources list List all sources
sources show <id> Show source details
preview <id> Dry-run extraction (no DB write)
extract <id> Run extraction and save
diff <id> Show changes since last run
schedule <id> <expr> Set a cron refresh schedule
serve Start local REST API + scheduler
export <id> Export to JSON/CSV
doctor Run health checks

Claude Code skill (optional)

If you use Claude Code, you can run the interactive wizard from any chat session with /websource or natural language like "scrape this URL".

bash scripts/install-skill.sh

Configuration

All config is optional. Copy .env.example to .env to override defaults:

Variable Default Description
WEBSOURCE_DATA_DIR ~/.local/share/websource Database and log location
WEBSOURCE_CONFIG_DIR ~/.config/websource Config file location
LOG_LEVEL warn trace / debug / info / warn / error

Data storage

All extracted data is stored locally in a single SQLite database:

~/.local/share/websource/
├── websource.db ← all data
└── logs/ ← log files (production mode only)
Table Contents
sources Source list (name, URL, status)
extraction_configs Field selectors, fetchMode, and other settings
runs Extraction run history (time, record counts, status)
snapshots The actual extracted records
diffs Added / changed / removed records between runs
schedules Cron schedule settings

Export extracted data:

# JSON
npx tsx bin/websource.ts export <sourceId> --format json
# CSV
npx tsx bin/websource.ts export <sourceId> --format csv
# REST API
npx tsx bin/websource.ts serve
# GET http://localhost:3847/sources/:id/data

Change the storage location — add to .env:

WEBSOURCE_DATA_DIR=/your/custom/path

Architecture

  • Node.js + TypeScript (ESM, strict)
  • Cheerio for static HTML parsing, Playwright for JS-rendered pages
  • SQLite (better-sqlite3) for all local persistence
  • Fastify for the local REST API
  • node-cron for scheduling

See docs/ARCHITECTURE.md for details.

Documentation

Contributing

See CONTRIBUTING.md.

License

MIT — see LICENSE.

About

Turn websites into reusable structured data sources — through conversation, not code.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

Contributors

Languages

  • TypeScript 99.6%
  • Other 0.4%

AltStyle によって変換されたページ (->オリジナル) /