Browser agents are hill climbing in the wrong direction.
This repository contains the full collection β replay β evaluation stack described in the research notes here: https://joan.so/learning/ml/research/browser-automation/0+main. The goal is to make it trivial to capture thousands of long-horizon, economically valuable browser trajectories, replay them offline, and grade agents with granular checkpoints.
A sample dataset of captured environments is available on Hugging Face:
π€ josancamon/trace-environments β 6 real-world tasks with full trajectories, DOM states, and replay bundles.
Each task includes:
- Golden human trajectory with step-by-step actions
- DOM snapshots at each interaction point
- HAR recordings for offline replay
- Screenshots and video recordings
- Checkpoint annotations for partial-credit evaluation
- Record real human browsing with stealth Playwright, HAR capture, DOM diffs, screenshots, and screen video.
- Package every task as a reproducible sandbox (
data/captures/...) that can be replayed without touching the live internet. - Run Browser-Use agents or OpenAI Computer Use (CUA) through the same evaluation harness.
- Judge outputs with binary + checkpointed LLM graders powered by DSPy.
- Ship the workflow to non-technical collectors through a Tk desktop app or PyInstaller bundle.
- Upload datasets to GCS / Hugging Face with ready-made scripts.
π Airbnb β Find an apartment under 200γγ« in Ho Chi Minh City
Multi-step booking flow: search location β select dates β apply price filter β browse results β select listing.
π Amazon β Complete a checkout with saved payment method
Full e-commerce flow: login β navigate to cart β proceed to checkout β select payment β place order.
π GitHub β Search and navigate repositories
Developer workflow: login β search repositories β filter results β navigate to user profile β explore projects.
πΉ See the collection process in action:
readme/collection.mp4
Warning
π§ Work in Progress β This pipeline is functional but not battle-tested. Expect rough edges, undocumented edge cases, and breaking changes. If you run into issues, please open an issue β your bug reports help shape the roadmap and will be documented as we go.
- TRACE: Trajectory Recording and Capture Environments
- Dataset
- Highlights
- Example Tasks
- Table of Contents
- Repository Map
- Getting Started
- Record a Task (CLI)
- What Gets Captured
- Post-Processing Pipeline
- Replay Captured Environments
- Evaluation Runners
- Evaluation Dashboard
- Judges & Metrics
- Workflow Overview
- Desktop Task Collector
- Data Review & Sharing
- Configuration Reference
- Resources
web-environments/
βββ src/
β βββ main.py # Typer CLI for recording (`web-envs`)
β βββ browser/ # Stealth Playwright capturer
β βββ environments/ # Offline replay + sandbox plumbing
β βββ eval/ # Agent runners, harness, judges
β βββ scripts/
β β βββ postprocessing/ # Tool calls, creds, checkpoints, ignore list
β β βββ collection/ # Merge/view/upload scripts
β βββ db/, config/, utils/ # Storage + helper modules
β βββ models.py # Shared data models
βββ app/ # Tk Task Collector + PyInstaller packaging
βββ data/ # Default recording/output root (configurable)
βββ readme/ # Documentation assets (images, videos)
βββ results/ # Agent evaluation dumps (per run)
βββ paper.md # Draft write-up / appendix
βββ setup.sh # macOS bootstrap for collectors
βββ pyproject.toml / uv.lock # Dependency + CLI entry points
βββ README.md # This file
- macOS or Linux (recording is built/tested on macOS 14+, replay works anywhere Playwright does).
- Python 3.13+ (enforced in
pyproject.toml). - uv package manager.
- Chrome/Chromium installed (recorder launches the
chromechannel by default).
curl -LsSf https://astral.sh/uv/install.sh | sh cd /Users/joancabezas/Downloads/projects/ai-research/web-environments uv sync
Install Playwright browsers once per machine:
uv run playwright install chromium
Create/update .env next to pyproject.toml and set at minimum:
OPENAI_API_KEY=sk-...
# Optional, used when falling back to Kernel live browsers during eval
KERNEL_API_KEY=...
# Optional override for data root (defaults to <repo>/data)
TASK_COLLECTOR_DATA_ROOT=/absolute/path/to/storage
uv run automatically loads .env via python-dotenv.
Run:
uv run web-envs
The CLI prompts for:
- Source (where the task came from: self, mind2web seed, etc.).
- Task type (information retrieval vs. action).
- Natural-language description and target website.
Flags:
--devβ auto-fill prompts with defaults for faster debugging.--dev-url https://example.comβ open a URL on launch (requires--dev).--no-browser-consoleβ silence console logs.
What happens:
- A stealth Playwright Chromium context launches with custom args and anti-detection scripts (
playwright-stealth+ custom JS). - Every DOM mutation, click, scroll, navigation, screenshot, video frame, HAR entry, console line, and storage snapshot is recorded.
- Hitting Enter in the terminal (or Ctrl+C) stops recording. Information-retrieval tasks additionally prompt for the final answer before closing.
Outputs (per task):
- SQLite rows in
data/tasks.db. - Raw steps in
data/steps/task_<id>.jsonl. - Bundle in
data/captures/task_<id>/. data/screenshots/task_<id>/*.png,data/videos/task_<id>/*.webm,data/doms/task_<id>/step_<n>.txt.
The recorder performs a lossless export of everything needed for offline replay:
- Human trajectory: ordered tool-call log with DOM context inline.
- Network layer: HAR file + per-request JSON dumps + LM-assisted ignore lists.
- Storage: cookies, localStorage, IndexedDB snapshots.
- Visuals: high-rate screenshots and a 720p video stream per session.
- Console + metadata: enriched with event timings, selectors, coordinates, and page-level events (tab opened/closed, focus changes, etc.).
Default layout (configurable through TASK_COLLECTOR_DATA_ROOT):
data/
βββ captures/task_<id>/manifest.json, recording.har, resources/, storage/
βββ screenshots/task_<id>/*.png
βββ videos/task_<id>/*.webm
βββ doms/task_<id>/step_<n>.txt
βββ tasks.db / tasks.db-shm / tasks.db-wal
βββ tasks.jsonl
βββ debug/, steps/, ...
The TRACE Environments dataset contains real examples with full capture bundles. Download and point TASK_COLLECTOR_DATA_ROOT to experiment without collecting new data.
Run these after collecting tasks. Each command is defined in pyproject.toml and executed with uv run <script>.
- Command:
uv run postprocess-toolcalls - What it does: Reads
tasks.db, converts low-level events into structured tool calls, snapshots DOM states, and stores everything indata/tasks.jsonl.
- Command:
uv run postprocess-credentials - Requires:
OPENAI_API_KEY - What it does: Uses DSPy + GPT-5 to detect login flows (email/password/phone, MFA, etc.) and associates them with domains so evaluations can inject credentials safely.
- Output: Augments each task entry in
tasks.jsonlwith acredentialsarray.
- Command:
uv run postprocess-set-checkpoints - Requires:
OPENAI_API_KEY - What it does: Generates at least two semantic checkpoints (index + reasoning) per task, enabling partial-credit scoring.
- Command:
uv run postprocess-determine-ignore [--force] - Requires:
OPENAI_API_KEY - What it does: Cleans the HAR recordings by removing analytics/tracking requests via pattern matching + batched LLM classification. Produces
ignored.jsonandmatches.jsonper capture so the replay system can stay fully offline and cache LM matches when needed.
Batch runner: src/scripts/postprocessing/run.sh executes all steps sequentially.
Use launch-environment to inspect or debug a bundle entirely offline.
uv run launch-environment data/captures/task_10 \ --channel chrome \ --run-human-trajectory \ --ignore-cache
Key options:
--headless / --no-headless--channel chrome|chromium|msedge--allow-network-fallback(defaultFalse) lets missing resources hit the live web.--include-storage-staterestores cookies/session storage.--run-human-trajectoryreplays the recorded actions with original pacing.--ignore-cachedisables LM match caching if you want fresh matches.
Behind the scenes the Typer app:
- Loads the capture manifest + HAR.
- Builds a Chromium context with HAR replay routing.
- Aborts any URL matched in
ignored.json. - Uses LM-based request matching when multiple HAR candidates resemble the same URL.
- Optionally executes the human tool-call trajectory (
TaskStepExecutor) for visual debugging.
uv run eval-run-browseruse --model gpt-5-nano
- Reads
data/tasks.jsonl. - Uses the sandbox bundles under
DATA_DIR/capturesby default; if a capture fails to start it falls back to Kernel live browsers (requiresKERNEL_API_KEY). - Captures agent DOM dumps per step in
results/<run>/doms/task_<id>/step_<n>.txt. - Writes JSON per task under
results/<run>/results/task_<id>.jsonplus a summary metadata file. - Limits concurrency to 1 when sandboxing to avoid cross-talk; uses up to 4 concurrent live sessions otherwise.
Result folder layout:
results/browseruse-gpt-5-nano-2025εΉ΄11ζ18ζ₯_04-12-27/
βββ doms/task_<id>/step_<n>.txt
βββ logs/
βββ results/task_<id>.json
βββ metadata.json
All of the benchmarks referenced in this repo feed into a shared dashboard at https://web-evals.streamlit.app that standardizes ingestion, filtering, and grading outputs across datasets. This was made for facilitating the exploration of them.
- GAIA (466 tasks) β general AI assistant benchmark with three difficulty tiers.
- Mind2Web (1,009 tasks) β real-world website interaction tasks.
- Mind2Web2 (130 tasks) β updated version with domain categorization.
- BrowseComp (1,266 tasks) β web browsing comprehension.
- WebArena (812 tasks) β realistic web navigation scenarios.
- WebVoyager (643 tasks) β long-horizon web navigation.
- REAL (113 tasks) β real-world web agent challenges with difficulty ratings.
- Bearcubs (111 tasks) β web agent evaluation tasks.
- Agent-Company (175 tasks) β domain-specific company workflows.
- OSWorld (400+ tasks) β desktop automation across Chrome, GIMP, LibreOffice, VS Code, etc.
- Interactive Streamlit dashboard for filtering, sorting, and exploration.
- REST API with full filtering, pagination, and schema stability guarantees.
- Unified schema that normalizes metadata across every benchmark.
- Advanced filtering by benchmark, difficulty, domain, website/app, and other tags.
- Global task search with full-text indexing over descriptions and instructions.
Two Typer CLIs grade agent outputs after a run.
uv run eval-judge-binary --results-dir results/browseruse-gpt-5-nano-2025εΉ΄11ζ18ζ₯_04-12-27 --judge-model gpt-5
- Compares each model trajectory against the human one.
- Uses DSPy (
JudgeCompletion) to decide correctness and produce reasoning + confidence. - Writes
grade.jsoninside the results directory with accuracy and failure breakdowns.
uv run eval-judge-checkpointed --results-dir results/browseruse-gpt-5-nano-2025εΉ΄11ζ18ζ₯_04-12-27 --judge-model gpt-5
- Loads
grade.json, finds failed tasks, and evaluates each checkpoint sequentially with partial credit (default 2 checkpoints, 0.33 score each). - Adds
checkpoint_evaluationand summary stats back intograde.json.
Both graders expect OPENAI_API_KEY to be configured and will stream multiple LLM calls, so budget accordingly.
- Collect
uv run web-envs - Process
uv run postprocess-toolcalls uv run postprocess-credentials uv run postprocess-set-checkpoints uv run postprocess-determine-ignore - Replay / QA (optional)
uv run launch-environment data/captures/task_42 - Evaluate agents
uv run eval-run-browseruse --model gpt-5-nano uv run python -m eval.run.openai_cua # optional runner - Grade + analyze
uv run eval-judge-binary --results-dir <results_path> uv run eval-judge-checkpointed --results-dir <results_path> - Upload / share (see below).
The app/ directory ships a Tkinter UI for non-technical collectors:
./app/launch_task_collector.sh # or PYTHONPATH="$(pwd)/src:$(pwd):$PYTHONPATH" uv run python -m app.task_collector_app
Features:
- Mirrors the CLI prompts in a GUI form.
- Launches the stealth browser when the collector clicks Start Task.
- Provides an Upload Data button wired to the GCS scripts.
- Logs to the same
data/directory configured by the backend.
uv run python app/build_release.py --target macos
This wraps the recorder with PyInstaller, bundles Playwright, and drops a ZIP under app/dist/. Replace macos with windows to build on Windows.
Helper scripts live in src/scripts/collection/ (run with uv run python -m ...):
scripts.collection.viewβ inspect tasks with a simple TUI.scripts.collection.mergeβ merge multipledata/folders.scripts.collection.upload_gcp_dataβ upload the entiredata/directory togs://mind2web-subset/data/(requiresgoogle-credentials.jsonin the repo root).scripts.collection.upload_gcp_resultsβ push theresults/directory.scripts.collection.upload_hfβ publish a curated subset to Hugging Face (configure dataset + token inside the script).
Example:
uv run python -m scripts.collection.upload_gcp_data
The desktop app exposes the same functionality through its UI for collectors.
OPENAI_API_KEYβ required for post-processing, grading, and LLM-powered evaluations.KERNEL_API_KEYβ required if you disable the sandbox or allow Kernel fallbacks during agent runs.RECORDER_BROWSER_CHANNELβ Chromium channel (defaultchrome). Set tochromiumormsedgeif Chrome is not installed.RECORDER_USER_DATA_DIRβ persistent Chrome profile (defaults todata/user-data).TASK_COLLECTOR_DATA_ROOTβ override storage root (useful when shipping the desktop app).PLAYWRIGHT_BROWSERS_PATHβ optional custom Playwright cache location.DSPY_CACHE_DIR,UV_LINK_MODE, etc. β advanced knobs documented inline in the respective modules.
- TRACE Environments Dataset: https://huggingface.co/datasets/josancamon/trace-environments
- Research notes & build logs: https://joan.so/learning/ml/research/browser-automation/0+main
- Benchmark tracker: https://web-evals.streamlit.app/
paper.mdin this repo for the in-progress write-up and additional context.
Questions or ideas? Open an issue or drop feedback in the discussionsβthe roadmap is driven by making browser agents genuinely useful, not just benchmark-good.