Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

feat: add coordinate space abstraction for open weights LLM support#282

Open
philipph-askui wants to merge 10 commits into
main from
feat/llm_coordinate_system
Open

feat: add coordinate space abstraction for open weights LLM support #282
philipph-askui wants to merge 10 commits into
main from
feat/llm_coordinate_system

Conversation

@philipph-askui

@philipph-askui philipph-askui commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Introduces VlmCoordinateSpace strategy (pixel, scaled, normalized) so agentOS facades can map model-emitted coordinates to screen pixels.
Thereby adds auto-detection for Qwen, Holo (0-1000 grid) and Kimi (0.0-1.0 floats) in OllamaVlmProvider.
Appends coordinate info to system prompts for OpenAI-compatible providers.

Replace fixed SCREENSHOT_RESOLUTION constant with per-provider image
scalers. Each VlmProvider now owns an ImageScaler callable and exposes
max_image_edge (also via ASKUI_VLM_MAX_IMAGE_EDGE env var). Facades
derive target_resolution dynamically from scaler output.

@programminx-askui programminx-askui left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello,

I've check the PR. Overall nice. We should introduce a ClickableCapability/ClickableTarget.

Comment on lines +81 to +82
image_scaler: ImageScaler | None = None,
max_image_edge: int | None = None,

@programminx-askui programminx-askui Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything what is images related should start with image_. Similare like model_. Otherwise it's hard to understand the zusammenhänge

Suggested change
image_scaler: ImageScaler | None = None,
max_image_edge: int | None = None,
image_scaler: ImageScaler | None = None,
image_edge_max: int | None = None,

@programminx-askui programminx-askui Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but is this not a property of the image scaler?

@philipph-askui philipph-askui Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. This value is only used if users do not specify an ImageScaler explicitly but the default one is used.

if self._image_scaler_override is not None:
return self._image_scaler_override
max_edge = self._max_edge
return lambda image: compute_patch_optimized_image(image, max_edge=max_edge)

@programminx-askui programminx-askui Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we sure, that we can use the lambda? I had bad experience with lambda in python, that they caach the computation. But can't remember

@philipph-askui philipph-askui Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored this to that we now have a proper ImageScaler base class with subclasses that are used here instead.

Comment on lines +74 to +92
self._image_scaler_override = image_scaler
self._max_edge = (
max_image_edge
or int(os.environ.get("ASKUI_VLM_MAX_IMAGE_EDGE", "0"))
or _DEFAULT_MAX_IMAGE_EDGE
)

@property
@override
def model_id(self) -> str:
return self._model_id_value

@property
@override
def image_scaler(self) -> ImageScaler:
if self._image_scaler_override is not None:
return self._image_scaler_override
max_edge = self._max_edge
return lambda image: compute_patch_optimized_image(image, max_edge=max_edge)

@programminx-askui programminx-askui Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why so complicated?

Suggested change
self._image_scaler_override = image_scaler
self._max_edge = (
max_image_edge
or int(os.environ.get("ASKUI_VLM_MAX_IMAGE_EDGE", "0"))
or _DEFAULT_MAX_IMAGE_EDGE
)
@property
@override
def model_id(self) -> str:
return self._model_id_value
@property
@override
def image_scaler(self) -> ImageScaler:
if self._image_scaler_override is not None:
return self._image_scaler_override
max_edge = self._max_edge
return lambda image: compute_patch_optimized_image(image, max_edge=max_edge)
self._image_scaler = image_scaler if image_scaler else lambda image: compute_patch_optimized_image(image, max_edge=max_edge)
self._max_edge = (
max_image_edge
or int(os.environ.get("ASKUI_VLM_MAX_IMAGE_EDGE", "0"))
or _DEFAULT_MAX_IMAGE_EDGE
)
@property
@override
def model_id(self) -> str:
return self._model_id_value

@philipph-askui philipph-askui Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored it to use a proper imagescaler class i/o lambdas

Comment on lines +75 to +79
self._max_edge = (
max_image_edge
or int(os.environ.get("ASKUI_VLM_MAX_IMAGE_EDGE", "0"))
or _DEFAULT_MAX_IMAGE_EDGE
)

@programminx-askui programminx-askui Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove it if possible

@philipph-askui philipph-askui Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean the env variable?

Comment on lines +44 to 46
self._scaler.real_screen_resolution = self._agent_os.screenshot(
report=False
).size

@programminx-askui programminx-askui Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The screen resolution can change over time

@philipph-askui philipph-askui Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this value will be update with every new screenshot

self._scaler = CoordinateScaler(
coordinate_space=coordinate_space,
image_scaler=image_scaler,
fetch_real_resolution=lambda: self._agent_os.screenshot(report=False).size,

@programminx-askui programminx-askui Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of lambda in python. What is the reason to use them?

@philipph-askui philipph-askui Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the lambdas

self._agent_os: AndroidAgentOs = agent_os
self._target_resolution: Tuple[int, int] = (1024, 768)
self._real_screen_resolution: Optional[Tuple[int, int]] = None
self._scaler = CoordinateScaler(

@programminx-askui programminx-askui Jun 18, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a note.

The "CoordinateScaler" should be on the DisplayCapability/DisplayTarget/WindowTarget/WindowCapability.

But here we assume, that we have only one ClickableTarget.

@philipph-askui philipph-askui Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at the moment it does not matter after all what target we are scaling for. The llm sees a screenshot and predicts a position on it. All of what we are doing is scaling the predicted coordinates to the correct absolute image coordinates of the screenshot we got

self.target_resolution = scaled.size
return scaled

def scale_coordinates(

@programminx-askui programminx-askui Jun 18, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this function? Should this not replaced from the ScaleCoordinate Layer?

@philipph-askui philipph-askui Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the scaleCoordinate Layer :-D or what do you mean?

Comment on lines +71 to +80
# Binary search for largest scale that fits within token budget
lo, hi = 0.0, scale
for _ in range(50):
mid = (lo + hi) / 2
w = max(1, int(width * mid))
h = max(1, int(height * mid))
if count_image_tokens(w, h, patch_size) <= max_tokens:
lo = mid
else:
hi = mid

@programminx-askui programminx-askui Jun 18, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this? can we move this to a own function?

@philipph-askui philipph-askui Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code was copied directly from an example from anthropic, so I would keep it here as is

PixelCoordinateSpace,
ScaledCoordinateSpace,
)
from askui.tools.android.agent_os_facade import AndroidAgentOsFacade

@programminx-askui programminx-askui Jun 18, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you checked the tests? I would add here some dynamic tests. with different resolutions and negative tests

@philipph-askui philipph-askui Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

@programminx-askui programminx-askui programminx-askui left review comments

@mlikasam-askui mlikasam-askui Awaiting requested review from mlikasam-askui

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

AltStyle によって変換されたページ (->オリジナル) /