-
Notifications
You must be signed in to change notification settings - Fork 62
feat: add coordinate space abstraction for open weights LLM support#282
feat: add coordinate space abstraction for open weights LLM support #282philipph-askui wants to merge 10 commits into
Conversation
Replace fixed SCREENSHOT_RESOLUTION constant with per-provider image scalers. Each VlmProvider now owns an ImageScaler callable and exposes max_image_edge (also via ASKUI_VLM_MAX_IMAGE_EDGE env var). Facades derive target_resolution dynamically from scaler output.
@programminx-askui
programminx-askui
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello,
I've check the PR. Overall nice. We should introduce a ClickableCapability/ClickableTarget.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything what is images related should start with image_. Similare like model_. Otherwise it's hard to understand the zusammenhänge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but is this not a property of the image scaler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. This value is only used if users do not specify an ImageScaler explicitly but the default one is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we sure, that we can use the lambda? I had bad experience with lambda in python, that they caach the computation. But can't remember
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactored this to that we now have a proper ImageScaler base class with subclasses that are used here instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why so complicated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactored it to use a proper imagescaler class i/o lambdas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove it if possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean the env variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The screen resolution can change over time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this value will be update with every new screenshot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a fan of lambda in python. What is the reason to use them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed the lambdas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a note.
The "CoordinateScaler" should be on the DisplayCapability/DisplayTarget/WindowTarget/WindowCapability.
But here we assume, that we have only one ClickableTarget.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at the moment it does not matter after all what target we are scaling for. The llm sees a screenshot and predicts a position on it. All of what we are doing is scaling the predicted coordinates to the correct absolute image coordinates of the screenshot we got
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this function? Should this not replaced from the ScaleCoordinate Layer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the scaleCoordinate Layer :-D or what do you mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need this? can we move this to a own function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code was copied directly from an example from anthropic, so I would keep it here as is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you checked the tests? I would add here some dynamic tests. with different resolutions and negative tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Introduces VlmCoordinateSpace strategy (pixel, scaled, normalized) so agentOS facades can map model-emitted coordinates to screen pixels.
Thereby adds auto-detection for Qwen, Holo (0-1000 grid) and Kimi (0.0-1.0 floats) in OllamaVlmProvider.
Appends coordinate info to system prompts for OpenAI-compatible providers.