fix(hid): retry the one-shot enumerate() through a transient probe miss (#218)#237

Open

buliwyf42 wants to merge 1 commit into

AprilNEA:master from

buliwyf42:fix/oneshot-enumerate-retry-and-watcher-classify

Open

fix(hid): retry the one-shot enumerate() through a transient probe miss (#218) #237
buliwyf42 wants to merge 1 commit into
AprilNEA:master from
buliwyf42:fix/oneshot-enumerate-retry-and-watcher-classify

Conversation

@buliwyf42

@buliwyf42 buliwyf42 commented Jun 13, 2026

Copy link

Copy Markdown

Follow-up to #222 (merged), as invited in its review.

Gap

#222's per-node NodeLedger replays a node's last good inventory across watcher ticks — that fixes the flapping device list. But the one-shot enumerate() free fn (used by the CLI list / diag commands) builds a fresh Enumerator per call, so its ledger is empty: a transient probe miss has nothing to replay and still surfaces as an empty/partial list. That's exactly the "two back-to-back isolated runs read 3 devices and 0" case in #218.

Fix

openlogi-hid — split the enumerate pass into enumerate_reporting_health, which also returns whether every probed node answered cleanly this pass (a probe timeout, an open failure, or a receiver read short of its pairing count is unhealthy).

The polling watcher keeps calling enumerate() and ignores the flag — the ledger already handles its per-node replay, so nothing changes there.
The one-shot enumerate() retries while a node is unhealthy, reusing the same Enumerator so its ledger accumulates a snapshot a later attempt can replay and the opened channel stays warm. Bounded to 4 attempts, 300 ms apart. fix: add a request timeout inside HidppChannel::send (fix hang root at the right layer) #226 's 5 s HidppChannel::send timeout makes a dead probe fail fast, so the retry is cheap.

A legitimately-skipped node (a receiver's secondary HID interface) reports healthy: true after a completed feature walk, so it doesn't trigger needless retries; zero candidates returns immediately (no spin).

openlogi-agent-core — factor the watcher's tick→event decision into WatchState::classify and unit-test it (invited in the #222 review; the watcher was untouched there). No behaviour change — the Err arm already kept the last snapshot. Tests assert: an enumerate failure after a success keeps the last snapshot; a clean empty Ok is still forwarded as a genuine disconnect; persistent initial failure reports Unavailable once, then recovers. (Adds async-hid as a dev-dependency only, to build an InventoryError::Hid in those tests.)

Verification

cargo fmt --all -- --check # clean
cargo clippy -p openlogi-hid -p openlogi-agent-core -p openlogi-cli --all-targets -- -D warnings # clean
cargo test -p openlogi-hid -p openlogi-agent-core # 61 + 23 pass (3 new watcher tests)

I can't build openlogi-gui locally (its Metal shaders need a full Xcode toolchain I don't have), but nothing here touches the GUI — InventoryError stays a single-variant enum with no exhaustive match anywhere, and CI covers the GUI.

Refs #218, #222.

@greptile-apps

greptile-apps Bot commented Jun 13, 2026 •

edited

Loading

Copy link

Copy Markdown

Greptile Summary

This PR fixes the one-shot enumerate() path (used by the CLI list/diag commands) so a transient probe miss no longer surfaces as an empty or partial device list. It also refactors the watcher's per-tick state into a testable WatchState::classify method and adds 3 unit tests for it.

openlogi-hid: enumerate() now retries up to 4 times (300 ms apart) by reusing the same Enumerator so its NodeLedger can accumulate a snapshot across attempts. A new enumerate_reporting_health() method returns whether every probed node answered cleanly, driving the retry decision. The polling watcher is unaffected.
openlogi-agent-core: Tick-to-event logic is extracted into WatchState::classify; watcher behaviour is unchanged. Three new unit tests assert the three documented scenarios.

Confidence Score: 5/5

Safe to merge. The one-shot retry loop is tightly bounded (4 attempts, 300 ms apart), the polling watcher path is untouched, and the WatchState refactor is behaviour-preserving with unit-test coverage.

No defects introduced. The health flag is correctly initialised and accumulated; zero-candidates returns immediately; open failures correctly mark the pass unhealthy; Unavailable fires exactly once for the current INITIAL_FAILURE_LIMIT value; and Enumerator reuse across retries makes the ledger replay work for one-shot callers.

No files require special attention.

Important Files Changed

Filename	Overview
crates/openlogi-hid/src/inventory.rs	Adds enumerate_reporting_health() and wires the one-shot enumerate() free fn to retry up to 4 times on unhealthy nodes. Health tracking is correctly initialised to true and accumulated with &=; zero-candidates returns immediately; open failures mark healthy=false.
crates/openlogi-agent-core/src/watchers/inventory.rs	Extracts WatchState and classify() from the poll loop; watcher behaviour is unchanged. Three new unit tests cover all documented scenarios. saturating_add ensures Unavailable fires exactly once for INITIAL_FAILURE_LIMIT=3.
crates/openlogi-agent-core/Cargo.toml	Adds async-hid as a dev-dependency only, scoped to constructing InventoryError::Hid in tests. No production dependency change.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
 A[enumerate free fn] --> B[fresh Enumerator]
 B --> C[enumerate_reporting_health]
 C --> D{OS-level error?}
 D -- Err --> E[Propagate, no retry]
 D -- Ok --> F{all_healthy?}
 F -- true --> G[Return inventories]
 F -- false --> H{attempt >= 4?}
 H -- yes --> G
 H -- no --> I[sleep 300ms, attempt++]
 I --> C

_{Reviews (2): Last reviewed commit: "fix(hid): retry the one-shot enumerate()..." | Re-trigger Greptile}

greptile-apps[bot]

greptile-apps Bot reviewed

Jun 13, 2026

View reviewed changes

crates/openlogi-agent-core/src/watchers/inventory.rs Outdated

@buliwyf42 buliwyf42 mentioned this pull request

Jun 13, 2026

[Bug]: Device list flaps (appears then vanishes ~2s) + device images never load — Bolt receiver, macOS 27 beta #218

Closed

15 tasks

@buliwyf42 @claude


 fix(hid): retry the one-shot enumerate() through a transient probe mi...

f682591

...ss (AprilNEA#218)
Follow-up to AprilNEA#222. The per-node `NodeLedger` from AprilNEA#222 replays a node's
last good inventory across watcher ticks, but the one-shot `enumerate()`
free fn (CLI `list` / `diag`) builds a fresh `Enumerator` whose ledger is
empty — so a transient probe miss has nothing to replay and still surfaces
as an empty/partial list. That's exactly the "two isolated runs read 3
devices and 0" case from AprilNEA#218.
- openlogi-hid: split the enumerate pass into `enumerate_reporting_health`,
 which also reports whether every probed node answered cleanly this pass
 (a probe timeout, open failure, or a receiver read short of its pairing
 count is unhealthy). The polling watcher ignores the flag (the ledger
 handles replay); the one-shot `enumerate()` retries on it, reusing the
 same enumerator so its ledger accumulates a snapshot to replay and the
 channel stays warm. AprilNEA#226's 5 s `HidppChannel::send` timeout keeps a dead
 probe fast, so the bounded retry (4 attempts, 300 ms apart) is cheap.
- openlogi-agent-core: factor the watcher tick->event decision into
 `WatchState::classify` (no behaviour change — `Err` already kept the last
 snapshot) and unit-test it: an enumerate failure after a success keeps the
 last snapshot, a clean empty `Ok` is still forwarded as a disconnect, and
 persistent initial failure reports `Unavailable` once then recovers.
 (Invited in the AprilNEA#222 review; the watcher was untouched by AprilNEA#222.)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@buliwyf42 buliwyf42 force-pushed the fix/oneshot-enumerate-retry-and-watcher-classify branch from e08443d to f682591 Compare

June 13, 2026 16:29

@buliwyf42

buliwyf42 commented Jun 13, 2026

Copy link

Copy Markdown

Author

Re: the Unifying health-signal P2 — good catch, but that coarseness is by design from #222, not introduced here. The Unifying path only surfaces online devices (via arrival events) — it doesn't poll offline slots the way Bolt does — so paired.len() == pairing_count can't be its health signal: a legitimately-offline Unifying device would fall short of the count every cycle and wedge it into a permanent "unhealthy" retry. Hence the coarser pairing_count.is_some().

This PR's one-shot retry just keys off whatever healthy the probe already reports, so it inherits that asymmetry rather than changing it. A tighter Unifying signal (e.g. an online-count baseline) is a reasonable follow-up, but it has to account for offline devices and arrival-drain timing, so I'd keep it out of this small retry-focused change. (The #218 repro is a Bolt receiver, where the signal is already tight.)

Addressed the loop-range nit, and rebased onto master to resolve the #223 conflict — back to MERGEABLE.

FWIW: I re-tested this branch against the real repro rig (macOS 27 beta, MX Master 4 on Bolt). The one-shot retry demonstrably works — openlogi list hit 3 deep-probe timeouts but returned all 3 devices on the 4th attempt. But the deeper #218 failure (a sustained deep-probe hang that only a receiver replug clears) is separate and survives; I've written it up on #218.

@buliwyf42 buliwyf42 mentioned this pull request

Jun 13, 2026

fix(hid): bound the Bolt per-slot probe so one hung device can't drop the whole receiver (#218) #251

Open

buliwyf42 added a commit to buliwyf42/OpenLogi that referenced this pull request

Jun 13, 2026

@buliwyf42 @claude


 fix(hid): bound the Bolt per-slot probe so one hung device can't drop...

1d42ddf

... the whole receiver (AprilNEA#218)
On the AprilNEA#218 repro rig (macOS 27 beta, MX Master 4 on a Bolt receiver) the
device list still dropped to "No devices" under a *sustained* failure that
AprilNEA#222's ledger and AprilNEA#237's retry don't cover: a single paired online device's
deep feature walk (`probe_features` / `Device::new`) hangs every cycle,
burning the whole 5 s `PROBE_BUDGET`. Because `probe_one` is the unit wrapped
in `timeout(PROBE_BUDGET)`, firing it discards the entire node — including the
pairing-register reads that succeeded — so the receiver yields nothing.
Confirmed device/OS-level (a standalone `openlogi list` hangs identically);
only a physical receiver replug cleared it.
The Unifying slot probe already guards exactly this with `UNIFYING_SLOT_PROBE`
(a per-slot cap, so a slow walk returns partial instead of starving the
budget). The Bolt slot probe had no equivalent — `probe_or_reuse` was awaited
bare, so one hung device blew the whole-receiver budget.
Mirror that guard to Bolt: a new `BOLT_SLOT_PROBE` (1.5 s) wraps the Bolt
slot's `probe_or_reuse`; on timeout the slot falls back to its cached /
identity data (codename + kind + online come from the pairing register, which
reads fine every cycle) instead of the whole `probe_one` timing out. A hung
device keeps showing with its last-known capabilities while the rest of the
receiver enumerates. 1.5 s is well above a healthy BTLE walk (sub-second) and
small enough that two simultaneously-hung slots still fit `PROBE_BUDGET` after
the arrival drain; the rare all-slots-hung case degrades to the existing
per-node ledger replay (AprilNEA#222).
Refs AprilNEA#218, AprilNEA#222.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

buliwyf42 added a commit to buliwyf42/OpenLogi that referenced this pull request

Jun 13, 2026

@buliwyf42 @claude


 fix(hid): bound the Bolt per-slot probe so one hung device can't drop...

34314ba

... the whole receiver (AprilNEA#218)
On the AprilNEA#218 repro rig (macOS 27 beta, MX Master 4 on a Bolt receiver) the
device list still dropped to "No devices" under a *sustained* failure that
AprilNEA#222's ledger and AprilNEA#237's retry don't cover: a single paired online device's
deep feature walk (`probe_features` / `Device::new`) hangs every cycle,
burning the whole 5 s `PROBE_BUDGET`. Because `probe_one` is the unit wrapped
in `timeout(PROBE_BUDGET)`, firing it discards the entire node — including the
pairing-register reads that succeeded — so the receiver yields nothing.
Confirmed device/OS-level (a standalone `openlogi list` hangs identically);
only a physical receiver replug cleared it.
The Unifying slot probe already guards exactly this with `UNIFYING_SLOT_PROBE`
(a per-slot cap, so a slow walk returns partial instead of starving the
budget). The Bolt slot probe had no equivalent — `probe_or_reuse` was awaited
bare, so one hung device blew the whole-receiver budget.
Mirror that guard to Bolt: a new `BOLT_SLOT_PROBE` (1.5 s) wraps the Bolt
slot's `probe_or_reuse`; on timeout the slot falls back to its cached /
identity data (codename + kind + online come from the pairing register, which
reads fine every cycle) instead of the whole `probe_one` timing out. A hung
device keeps showing with its last-known capabilities while the rest of the
receiver enumerates. 1.5 s is well above a healthy BTLE walk (sub-second) and
small enough that two simultaneously-hung slots still fit `PROBE_BUDGET` after
the arrival drain; the rare all-slots-hung case degrades to the existing
per-node ledger replay (AprilNEA#222).
Refs AprilNEA#218, AprilNEA#222.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Labels

None yet

1 participant

@buliwyf42

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(hid): retry the one-shot enumerate() through a transient probe miss (#218)#237

fix(hid): retry the one-shot enumerate() through a transient probe miss (#218) #237
buliwyf42 wants to merge 1 commit into
AprilNEA:master from
buliwyf42:fix/oneshot-enumerate-retry-and-watcher-classify

Conversation

@buliwyf42 buliwyf42 commented Jun 13, 2026

Gap

Fix

Verification

Uh oh!

greptile-apps Bot commented Jun 13, 2026 •

edited

Loading

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

buliwyf42 commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

@buliwyf42 buliwyf42 commented Jun 13, 2026

Gap

Fix

Verification

Uh oh!

greptile-apps Bot commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

buliwyf42 commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps Bot commented Jun 13, 2026 •

edited

Loading