Instead of guessing pixels, the assistant can ask the operating system for the UI tree:
list windows -> focus app -> find input -> set value -> send Enter -> read text
That means it can find a textbox by control type, set its value through the accessibility API, invoke a button, read visible text, and only fall back to screenshots when the app does not expose useful accessibility metadata.
This is the bridge I wanted: a coding assistant that can work in repos, but also operate the desktop applications that surround the repo.
Where This Is Going
The current shape is:
- CliGate routes AI coding tools through one local server.
- Runtime sessions keep Codex and Claude Code work alive.
- The assistant watches, coordinates, and summarizes.
- Skills give it reusable procedures.
- Desktop control gives it a path into native apps and GUI workflows.
That combination changes the product from "proxy for AI tools" into "local operator for developer workflows."
I think the desktop-control layer deserves its own post, because "AI can operate any app through the OS accessibility tree" is a deeper topic than I can fit here.
The project is open source here: CliGate on GitHub
How are you handling the boundary between coding agents and the desktop apps they still need to interact with?