Name	Name	Last commit message	Last commit date
Latest commit History 37 Commits
.github/workflows	.github/workflows
docs	docs
frontend_flutter	frontend_flutter
packages	packages
packaging	packaging
tests	tests
utils	utils
.DS_Store	.DS_Store
.gitignore	.gitignore
CI_CD_SETUP_COMPLETE.md	CI_CD_SETUP_COMPLETE.md
CODE_STYLE.md	CODE_STYLE.md
CONTRIBUTING.md	CONTRIBUTING.md
LICENSE	LICENSE
Makefile	Makefile
NOTICE	NOTICE
README.md	README.md
TODO.md	TODO.md
VERSION	VERSION
di.py	di.py
launcher.py	launcher.py
main.py	main.py
pyproject.toml	pyproject.toml
pytest.ini	pytest.ini
requirements.txt	requirements.txt

OS AI Computer Use

2025年10月19日.22.26.02.mov

OS AI Computer Use

Local agent for desktop automation. It currently integrates Anthropic Computer Use (Claude) but is architected to be provider‐agnostic: the LLM layer is abstracted behind LLMClient, so OpenAI Computer Use (and others) can be added with minimal changes.

What this project is:

A provider‐agnostic Computer Use agent with a stable tool interface
An OS‐agnostic execution layer using ports/drivers (macOS and Windows today)
A CLI you can bundle into a single executable for local use

What it is not (yet):

A remote SaaS; this is a local agent
A finished set of drivers for every OS/desktop (Linux Wayland has limits for synthetic input)

Highlights:

Smooth mouse movement, clicks, drag‐and‐drop with easing and timing controls
Reliable keyboard input (robust Enter on macOS), hotkeys and hold sequences
Screenshots (Quartz on macOS or PyAutoGUI fallback), on‐disk saving and base64 tool_result
Detailed logs and running cost estimation per iteration and total
Multiple chats
Images upload
Voice input
AI API Agnostic

See provider architecture in docs/architecture-universal-llm.md, OS ports/drivers in docs/os-architecture.md, and packaging notes in docs/ci-packaging.md.

Installation & Setup

Requirements:

macOS 13+ or Windows 10/11
Python 3.12+
Anthropic API key: ANTHROPIC_API_KEY (for now; OpenAI planned)

Install:

# (optional) create and activate venv
python -m venv .venv && source .venv/bin/activate
# install dependencies
make install
# (optional) install local packages in editable mode (mono-repo dev)
make dev-install

macOS permissions (for GUI automation):

make macos-perms # opens System Settings → Privacy & Security panels

Grant permissions to Terminal/iTerm and your venv Python under: Accessibility, Input Monitoring, Screen Recording.

Quick start

Requirements:

macOS 13+ or Windows 10/11 (unit tests on any OS; GUI tests macOS/self‐hosted Windows)
Python 3.12+
Anthropic API key (ANTHROPIC_API_KEY)

Install:

# (optional) create and activate venv
python -m venv .venv && source .venv/bin/activate
# install top-level dependencies
make install

macOS permissions (required for GUI automation):

# open System Settings → Privacy & Security panels
make macos-perms

Grant permissions to Terminal/iTerm and your venv Python under: Accessibility, Input Monitoring, Screen Recording.

Run the agent (CLI):

export ANTHROPIC_API_KEY=sk-ant-...
python main.py --provider anthropic --debug --task "Open Safari, search for 'macOS automation', scroll, make a screenshot"

CLI Examples

# 1) Open Chrome, search in Google, take a screenshot
python main.py --provider anthropic --task "Open Chrome, focus the address bar, type google.com, search for 'computer use AI', open first result, scroll down and take a screenshot"
# 2) Copy/paste workflow in a text editor
python main.py --provider anthropic --task "Open TextEdit, create a new document, type 'Hello world!', select all and copy, create another document and paste"
# 3) Window management + hotkeys
python main.py --provider anthropic --task "Open System Settings, search for 'Privacy', navigate to Privacy & Security, disable GEO"
# 4) Precise drag operations
python main.py --provider anthropic --task "In Finder, open Downloads, switch to icon view, drag the first file to Desktop"

Useful make targets:

make install # install top-level dependencies
make test # unit tests
RUN_CURSOR_TESTS=1 make itest # GUI integration tests (macOS; requires permissions)
make itest-local-keyboard # run keyboard harness
make itest-local-click # run click/drag harness

Development Mode

For development with backend + frontend (Flutter UI):

1. Install dependencies

# (optional) create and activate venv
python -m venv .venv && source .venv/bin/activate
# install Python dependencies
make install
# install local packages in editable mode for mono-repo dev
make dev-install

2. Start the backend

# Set your API key
export ANTHROPIC_API_KEY=sk-ant-...
# (optional) enable debug mode
export OS_AI_BACKEND_DEBUG=1
# Start backend on 127.0.0.1:8765
os-ai-backend
# Or run directly via Python module
# python -m os_ai_backend.app

Backend environment variables (optional):

OS_AI_BACKEND_HOST - host address (default: 127.0.0.1)
OS_AI_BACKEND_PORT - port number (default: 8765)
OS_AI_BACKEND_DEBUG - enable debug logging (default: 0)
OS_AI_BACKEND_TOKEN - authentication token (optional)
OS_AI_BACKEND_CORS_ORIGINS - allowed CORS origins (default: http://localhost,http://127.0.0.1)

Backend endpoints:

GET /healthz - health check
WS /ws - WebSocket for JSON-RPC commands
POST /v1/files - file upload
GET /v1/files/{file_id} - file download
GET /metrics - metrics snapshot

3. Start the frontend (in a new terminal)

cd frontend_flutter
# Install Flutter dependencies
flutter pub get
# Run on macOS
flutter run -d macos
# Or run on other platforms
# flutter run -d chrome # web
# flutter run -d windows # Windows

Frontend config (in code):

Default backend WebSocket: ws://127.0.0.1:8765/ws
Default REST base: http://127.0.0.1:8765

See frontend_flutter/README.md for more details on the Flutter app architecture and features.

Features

Smooth mouse motion: easing, distance‐based durations
Clicks with modifiers: modifiers: "cmd+shift" for click/down/up
Drag control: hold_before_ms, hold_after_ms, steps, step_delay
Keyboard input: key, hold_key; robust Enter on macOS via Quartz
Screenshots: Quartz (macOS) or PyAutoGUI fallback; optional downscale for model display
Logging and cost: per‐iteration and total usage/cost with 429 retry logic

Supported Platforms

OS‐agnostic execution: core depends only on OS ports; drivers are loaded per OS (see docs/os-architecture.md).
macOS (supported):
- Full driver set with overlay (AppKit), robust Enter (Quartz), screenshots (Quartz/PyAutoGUI), sounds (NSSound).
- Integration tests available; requires Accessibility, Input Monitoring, Screen Recording.
- Single‐file CLI bundle via make build-macos-bundle.
Windows (implemented, not yet integration‐tested):
- Drivers for mouse/keyboard/screen via PyAutoGUI; overlay/sound are no‐ops baseline.
- Unit contract tests exist; for GUI tests use a self‐hosted Windows runner (see docs/windows-integration-testing.md).
- Single‐file CLI bundle via make build-windows-bundle (build on Windows).
Linux: not provided out‐of‐the‐box. X11 can support synthetic input (XTest), while Wayland often restricts it. Contributions welcome.

Configuration (config/settings.py)

Key options (partial list):

Coordinates/calibration
- COORD_X_SCALE, COORD_Y_SCALE, COORD_X_OFFSET, COORD_Y_OFFSET
- Post‐move correction: POST_MOVE_VERIFY, POST_MOVE_TOLERANCE_PX, POST_MOVE_CORRECTION_DURATION
Screenshots
- SCREENSHOT_MODE (native|downscale)
- VIRTUAL_DISPLAY_ENABLED, VIRTUAL_DISPLAY_WIDTH_PX, VIRTUAL_DISPLAY_HEIGHT_PX
- SCREENSHOT_FORMAT (PNG|JPEG), SCREENSHOT_JPEG_QUALITY
Overlay
- PREMOVE_HIGHLIGHT_ENABLED, PREMOVE_HIGHLIGHT_DEFAULT_DURATION, PREMOVE_HIGHLIGHT_RADIUS, colors
Model/tool
- MODEL_NAME, COMPUTER_TOOL_TYPE, COMPUTER_BETA_FLAG, MAX_TOKENS
- ALLOW_PARALLEL_TOOL_USE

See file for full list and comments.

Tool input (API)

The agent expects blocks with action and parameters:

Mouse movement

{"action":"mouse_move","coordinate":[x,y],"coordinate_space":"auto|screen|model","duration":0.35,"tween":"linear"}

Clicks

{"action":"left_click","coordinate":[x,y],"modifiers":"cmd+shift"}

Key press / hold

{"action":"key","key":"cmd+l"}
{"action":"hold_key","key":"ctrl+shift+t"}

Drag‐and‐drop

{
 "action":"left_click_drag",
 "start":[x1,y1],
 "end":[x2,y2],
 "modifiers":"shift",
 "hold_before_ms":80,
 "hold_after_ms":80,
 "steps":4,
 "step_delay":0.02
}

Scroll

{"action":"scroll","coordinate":[x,y],"scroll_direction":"down|up|left|right","scroll_amount":3}

Typing

{"action":"type","text":"Hello, world!"}

Screenshot

{"action":"screenshot"}

Responses are returned as a list of tool_result content blocks (text/image). Screenshots are base64‐encoded.

Tests

Unit tests (no real GUI):

make test

Integration (real OS tests, macOS; Windows via self‐hosted runner):

export RUN_CURSOR_TESTS=1
make itest

If macOS blocks automation, tests are skipped. Grant permissions with make macos-perms and retry.

Windows integration testing options are described in docs/windows-integration-testing.md.

Flutter integration

Recommended setup: Flutter as pure UI, local Python service:

Transport: WebSocket + JSON‐RPC for chat/commands, REST for files
Streams: screenshots (JPEG/PNG), logs, events
Example notes: docs/flutter.md

To run backend + frontend in development mode, see the Development Mode section above.

Note: project code and docs use English.

Contributing

Fork → feature branch → PR
Code style: readable, explicit names, avoid deep nesting
Tests: add unit tests and integration tests when applicable
Before PR:

make test
RUN_CURSOR_TESTS=1 make itest # optional if GUI interactions changed

Commit messages: clear and atomic

Architecture, packaging and testing docs:

OS Ports & Drivers: docs/os-architecture.md
Packaging & CI: docs/ci-packaging.md
Windows integration testing: docs/windows-integration-testing.md
Code style: CODE_STYLE.md
Contributing: CONTRIBUTING.md

Packaging (single executable bundles):

macOS: make build-macos-bundle → dist/agent_core/agent_core
Windows: make build-windows-bundle → dist/agent_core/agent_core.exe

License

Apache License 2.0. Preserve NOTICE when distributing.

See LICENSE and NOTICE at repository root.

Troubleshooting

Cursor/keyboard don’t work (macOS): grant permissions in System Settings → Privacy & Security (Accessibility, Input Monitoring, Screen Recording) for Terminal and current Python.
Integration tests skipped: restart terminal, ensure same interpreter (which python, python -c 'import sys; print(sys.executable)').
Screenshots empty/missing overlay: enable Screen Recording; check screenshot mode settings.

Contact

Issues/PR in this repository. Attribution is listed in NOTICE.

License

777genius/os-ai-computer-use

Folders and files

Latest commit

History

Repository files navigation

OS AI Computer Use

Table of Contents

Installation & Setup

Quick start

CLI Examples

Development Mode

1. Install dependencies

2. Start the backend

3. Start the frontend (in a new terminal)

Features

Supported Platforms

Configuration (config/settings.py)

Tool input (API)

Tests

Flutter integration

Contributing

License

Troubleshooting

Contact

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 2

Contributors 2

Uh oh!

Languages