A minimal REST microservice that automates a local Chrome/Chromium instance with Selenium. It mirrors other Hydra services:
- Creates and uses a private virtualenv on first run.
- Bootstraps dependencies and writes a
.envfile with sane defaults. - Applies global rate limiting and a concurrency guard to Selenium operations.
- Exposes endpoints for browser lifecycle, navigation, DOM capture, scrolling, history, and screenshots.
- Streams structured Server-Sent Events (SSE) so clients can react in real time.
- Self-bootstrapping: Creates
.venv, upgradespip, installs Python deps, and writes.envon first run. - Chrome driver discovery: Tries Selenium Manager, snap/system
chromedriver,webdriver-manager, and architecture-specific fallbacks (x86_64/ARM64). - Single active browser per process with session metadata and event queues.
- Token-bucket rate limiting per client IP.
- Bounded concurrency around Selenium operations.
- SSE event stream (
/events) forstatus,dom, andframeupdates. - Screenshots are saved to
./frames/and available inline (base64) and via static file serving.
-
Python 3.9+
-
Chrome or Chromium installed
- Optionally set
CHROME_BINto point at the browser binary.
- Optionally set
-
Chromedriver (usually auto-managed)
- The service attempts: Selenium Manager → snap/system chromedriver →
webdriver-manager(x86_64) → Chrome-for-Testing ARM64 download →snap install chromiumfallback. - Some fallbacks may require
sudo.
- The service attempts: Selenium Manager → snap/system chromedriver →
git clone https://github.com/robit-man/web-scrape-service.git && cd web-scrape-service/scrape && python3 web_scrape.py
# On first run, it will: # - Create .venv/ # - Install dependencies # - Write .env with defaults # - Re-exec under the virtualenv and start the server # 2) In another terminal, set convenience vars: cd /path/to/hydra-scrape export API="http://127.0.0.1:8130" export KEY="$(grep '^SCRAPE_API_KEY=' .env | cut -d= -f2)" # 3) Check health: curl -s "$API/health" | jq # 4) Start a browser session (headless by default via .env): curl -s -X POST "$API/session/start" \ -H "Content-Type: application/json" \ -H "X-API-Key: $KEY" \ -d '{"headless": true}' | jq # → Save "session_id" from the response: export SID="<value from response>" # 5) Navigate to a URL: curl -s -X POST "$API/navigate" \ -H "Content-Type: application/json" \ -H "X-API-Key: $KEY" \ -d '{"sid":"'"$SID"'", "url":"https://example.com"}' | jq # 6) Take a screenshot: curl -s "$API/screenshot?sid=$SID" -H "X-API-Key: $KEY" | jq # 7) Stream events (SSE): curl -N "$API/events?sid=$SID" -H "X-API-Key: $KEY"
The service listens on SCRAPE_BIND:SCRAPE_PORT (defaults: 0.0.0.0:8130).
Configuration is read from .env in the script directory (auto-generated on first run). Defaults:
| Variable | Default | Description |
|---|---|---|
SCRAPE_API_KEY |
random UUID | API key used when auth is required. |
SCRAPE_BIND |
0.0.0.0 |
Bind address for the Flask server. |
SCRAPE_PORT |
8130 |
Server port. |
SCRAPE_REQUIRE_AUTH |
0 |
Set 1/true to require API key. |
SCRAPE_MAX_CONCURRENCY |
2 |
Max concurrent Selenium ops. |
SCRAPE_QUEUE_TIMEOUT_S |
0 |
Max time to wait for a concurrency slot (0 = block). |
SCRAPE_RATE_LIMIT_RPS |
10 |
Token refill rate per IP (requests/sec). |
SCRAPE_RATE_LIMIT_BURST |
20 |
Max tokens (burst) per IP. |
SCRAPE_FILE_TTL_S |
900 |
TTL for files in ./frames (seconds). |
SCRAPE_FRAME_KEEPALIVE_S |
45 |
SSE keepalive heartbeat interval (seconds). |
SCRAPE_HEADLESS_DEFAULT |
1 |
Default headless mode for browser sessions. |
CHROME_BIN |
(unset) | Optional path to Chrome/Chromium binary. |
Note: Code defaults may differ if
.envvalues are removed; the scaffold above is what the script writes initially.
-
Virtualenv: The script re-execs itself under
./.venvand installs dependencies:Flask,Flask-Cors,python-dotenv,requests,beautifulsoup4,lxml,selenium,webdriver-manager,pillow
-
Frames directory: Screenshots are written to
./frames/. A background cleaner removes files older thanSCRAPE_FILE_TTL_S. -
Sessions: A single active browser (Chrome/Chromium) per process. Starting a new session clears previous session metadata and queues.
-
Logging: Timestamps and levels are printed to stderr/stdout.
All success responses use the shape: {"ok": true, ...}. Errors use {"ok": false, "error": "<message>"} with appropriate HTTP status.
Basic service status.
Response
{
"status": "ok",
"browser_open": false,
"sessions": 0
}Start or close the single browser session.
Request (start)
{ "headless": true }Response (start)
{
"ok": true,
"session_id": "<sid>",
"message": "Browser launched (...)",
"headless": true
}Response (close)
{ "ok": true, "message": "Browser closed" }Starting a session emits an SSE
statusevent withmsg="browser_started"and the newsid.
POST /navigate • POST /click • POST /type • POST /scroll • POST /scroll/up • POST /scroll/down • POST /history/back • POST /history/forward
High-level browser actions.
Requests
-
POST /navigate{ "sid": "<sid>", "url": "https://example.com" } -
POST /click{ "sid": "<sid>", "selector": "a.primary" } -
POST /type{ "sid": "<sid>", "selector": "input[name=q]", "text": "hydra\n" } -
POST /scroll(down by default){ "sid": "<sid>", "amount": 600 } -
POST /scroll/up{ "sid": "<sid>", "amount": 600 } -
POST /scroll/down{ "sid": "<sid>", "amount": 600 } -
POST /history/backandPOST /history/forward{ "sid": "<sid>" }
Responses
{ "ok": true, "message": "..." }Each successful call queues a status event on the session SSE stream.
Click by viewport coordinates (useful when CSS selectors are difficult).
Request
{
"sid": "<sid>",
"x": 512, "y": 384, // viewport space
"viewportW": 1280, "viewportH": 800, // current viewport size (required)
"naturalW": 1280, "naturalH": 800 // page's "natural" width/height (optional)
}Response
{
"ok": true,
"message": "click_xy",
"detail": {
"ok": true,
"tag": "A",
"rect": { "x": 100, "y": 200, "width": 120, "height": 20 }
}
}The service computes a scale from
viewport*tonatural*and clicks the element at the transformed point. On success, astatusevent is emitted.
Return a DOM snapshot (outerHTML) truncated to ~200,000 characters.
Query
/dom?sid=<sid>
Response
{
"ok": true,
"dom": "<!doctype html> ...",
"length": 123456
}Emits an SSE dom event with chars=<length>.
Capture a PNG screenshot and return both an inline base64 and a file path under /frames.
Query
/screenshot?sid=<sid>
Response
{
"ok": true,
"file": "/frames/2a8f...c1.png",
"width": 1920,
"height": 1080,
"mime": "image/png",
"b64": "<base64 data>"
}Also emits an SSE frame event with file path, dimensions, MIME, and base64.
Serve a previously captured image from ./frames/.
Example
GET /frames/2a8f...c1.png
Event stream for a session.
Query
/events?sid=<sid>
Response
- Content-Type:
text/event-stream - Messages are emitted as
data: {...}\n\nJSON payloads. - Periodic
":\n\n"comments are sent as keepalives everySCRAPE_FRAME_KEEPALIVE_Sseconds.
Example
curl -N -H "X-API-Key: $KEY" "$API/events?sid=$SID"
If SCRAPE_REQUIRE_AUTH=1 (or true), requests must include either:
X-API-Key: <SCRAPE_API_KEY>Authorization: Bearer <SCRAPE_API_KEY>
Otherwise, endpoints return 401 with {"ok": false, "error": "unauthorized"}.
A token-bucket is applied per client IP:
- Refill:
SCRAPE_RATE_LIMIT_RPStokens/sec - Burst capacity:
SCRAPE_RATE_LIMIT_BURST - On depletion:
429with{"ok": false, "error":"rate limit"}andRetry-After: 1.
A global BoundedSemaphore(SCRAPE_MAX_CONCURRENCY) throttles Selenium operations. Each endpoint acquires a slot:
- If
SCRAPE_QUEUE_TIMEOUT_S > 0, requests wait up to that many seconds before returning503with{"ok": false, "error":"scrape at capacity"}. - If
SCRAPE_QUEUE_TIMEOUT_S == 0(default in scaffold), requests block until a slot is available.
Events are JSON objects emitted on /events for a given sid.
-
Status
{ "type": "status", "msg": "browser_started", "detail": "Browser launched (...)", "sid": "<sid>", "ts": 1710000000000 }Other
msgvalues include navigation and interactions (e.g.,"Clicked <selector>","Scrolled by <n>"). -
DOM
{ "type": "dom", "chars": 123456, "ts": 1710000000000 } -
Frame
{ "type": "frame", "file": "/frames/2a8f...c1.png", "width": 1920, "height": 1080, "mime": "image/png", "b64": "<base64>", "ts": 1710000000000 }
- Single browser: Only one browser session is managed at a time. Calling
/session/startclears previous session metadata and queues. - Driver selection: The service tries multiple strategies (Selenium Manager, snap/system,
webdriver-manager, architecture-specific installers). Some paths usesudoand may prompt if not pre-authorized. - Headless: Default comes from
SCRAPE_HEADLESS_DEFAULT; can be overridden per session with{"headless": true/false}in/session/start. - CORS: Enabled for all routes.
-
browser not open/no dom (browser closed?)/409Start a session first:POST /session/start. -
rate limit/429Slow down requests or raise limits via.env. -
scrape at capacity/503IncreaseSCRAPE_MAX_CONCURRENCY, lower request volume, or adjustSCRAPE_QUEUE_TIMEOUT_S. -
Chromedriver errors Ensure Chrome/Chromium is installed and versions match. You may set
CHROME_BINand install a matchingchromedriveronPATH. On ARM64/x86_64, the service attempts to self-install; failures here may require manual setup.
- Treat
SCRAPE_API_KEYas a secret; rotate it regularly. - When exposing beyond localhost, run behind a reverse proxy with TLS.
- Headless automation can interact with arbitrary websites; restrict access to trusted clients and consider network egress controls.
- SSE responses may include base64 screenshots; ensure consumers handle sensitive content appropriately.
- Be cautious with endpoints that execute page JavaScript (e.g.,
click_xylogic usesdocument.elementFromPointandel.click()).