Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

fix(hid): retry the one-shot enumerate() through a transient probe miss (#218)#237

Open
buliwyf42 wants to merge 1 commit into
AprilNEA:master from
buliwyf42:fix/oneshot-enumerate-retry-and-watcher-classify
Open

fix(hid): retry the one-shot enumerate() through a transient probe miss (#218) #237
buliwyf42 wants to merge 1 commit into
AprilNEA:master from
buliwyf42:fix/oneshot-enumerate-retry-and-watcher-classify

Conversation

@buliwyf42

@buliwyf42 buliwyf42 commented Jun 13, 2026

Copy link
Copy Markdown

Follow-up to #222 (merged), as invited in its review.

Gap

#222's per-node NodeLedger replays a node's last good inventory across watcher ticks — that fixes the flapping device list. But the one-shot enumerate() free fn (used by the CLI list / diag commands) builds a fresh Enumerator per call, so its ledger is empty: a transient probe miss has nothing to replay and still surfaces as an empty/partial list. That's exactly the "two back-to-back isolated runs read 3 devices and 0" case in #218.

Fix

openlogi-hid — split the enumerate pass into enumerate_reporting_health, which also returns whether every probed node answered cleanly this pass (a probe timeout, an open failure, or a receiver read short of its pairing count is unhealthy).

  • The polling watcher keeps calling enumerate() and ignores the flag — the ledger already handles its per-node replay, so nothing changes there.
  • The one-shot enumerate() retries while a node is unhealthy, reusing the same Enumerator so its ledger accumulates a snapshot a later attempt can replay and the opened channel stays warm. Bounded to 4 attempts, 300 ms apart. fix: add a request timeout inside HidppChannel::send (fix hang root at the right layer) #226 's 5 s HidppChannel::send timeout makes a dead probe fail fast, so the retry is cheap.

A legitimately-skipped node (a receiver's secondary HID interface) reports healthy: true after a completed feature walk, so it doesn't trigger needless retries; zero candidates returns immediately (no spin).

openlogi-agent-core — factor the watcher's tick→event decision into WatchState::classify and unit-test it (invited in the #222 review; the watcher was untouched there). No behaviour change — the Err arm already kept the last snapshot. Tests assert: an enumerate failure after a success keeps the last snapshot; a clean empty Ok is still forwarded as a genuine disconnect; persistent initial failure reports Unavailable once, then recovers. (Adds async-hid as a dev-dependency only, to build an InventoryError::Hid in those tests.)

Verification

cargo fmt --all -- --check # clean
cargo clippy -p openlogi-hid -p openlogi-agent-core -p openlogi-cli --all-targets -- -D warnings # clean
cargo test -p openlogi-hid -p openlogi-agent-core # 61 + 23 pass (3 new watcher tests)

I can't build openlogi-gui locally (its Metal shaders need a full Xcode toolchain I don't have), but nothing here touches the GUI — InventoryError stays a single-variant enum with no exhaustive match anywhere, and CI covers the GUI.

Refs #218, #222.

greptile-apps[bot] reacted with thumbs up emoji

greptile-apps Bot commented Jun 13, 2026
edited
Loading

Copy link
Copy Markdown

Greptile Summary

This PR fixes the one-shot enumerate() path (used by the CLI list/diag commands) so a transient probe miss no longer surfaces as an empty or partial device list. It also refactors the watcher's per-tick state into a testable WatchState::classify method and adds 3 unit tests for it.

  • openlogi-hid: enumerate() now retries up to 4 times (300 ms apart) by reusing the same Enumerator so its NodeLedger can accumulate a snapshot across attempts. A new enumerate_reporting_health() method returns whether every probed node answered cleanly, driving the retry decision. The polling watcher is unaffected.
  • openlogi-agent-core: Tick-to-event logic is extracted into WatchState::classify; watcher behaviour is unchanged. Three new unit tests assert the three documented scenarios.

Confidence Score: 5/5

Safe to merge. The one-shot retry loop is tightly bounded (4 attempts, 300 ms apart), the polling watcher path is untouched, and the WatchState refactor is behaviour-preserving with unit-test coverage.

No defects introduced. The health flag is correctly initialised and accumulated; zero-candidates returns immediately; open failures correctly mark the pass unhealthy; Unavailable fires exactly once for the current INITIAL_FAILURE_LIMIT value; and Enumerator reuse across retries makes the ledger replay work for one-shot callers.

No files require special attention.

Important Files Changed

Filename Overview
crates/openlogi-hid/src/inventory.rs Adds enumerate_reporting_health() and wires the one-shot enumerate() free fn to retry up to 4 times on unhealthy nodes. Health tracking is correctly initialised to true and accumulated with &=; zero-candidates returns immediately; open failures mark healthy=false.
crates/openlogi-agent-core/src/watchers/inventory.rs Extracts WatchState and classify() from the poll loop; watcher behaviour is unchanged. Three new unit tests cover all documented scenarios. saturating_add ensures Unavailable fires exactly once for INITIAL_FAILURE_LIMIT=3.
crates/openlogi-agent-core/Cargo.toml Adds async-hid as a dev-dependency only, scoped to constructing InventoryError::Hid in tests. No production dependency change.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
 A[enumerate free fn] --> B[fresh Enumerator]
 B --> C[enumerate_reporting_health]
 C --> D{OS-level error?}
 D -- Err --> E[Propagate, no retry]
 D -- Ok --> F{all_healthy?}
 F -- true --> G[Return inventories]
 F -- false --> H{attempt >= 4?}
 H -- yes --> G
 H -- no --> I[sleep 300ms, attempt++]
 I --> C
Loading

Reviews (2): Last reviewed commit: "fix(hid): retry the one-shot enumerate()..." | Re-trigger Greptile

Comment thread crates/openlogi-agent-core/src/watchers/inventory.rs Outdated
...ss (AprilNEA#218)
Follow-up to AprilNEA#222. The per-node `NodeLedger` from AprilNEA#222 replays a node's
last good inventory across watcher ticks, but the one-shot `enumerate()`
free fn (CLI `list` / `diag`) builds a fresh `Enumerator` whose ledger is
empty — so a transient probe miss has nothing to replay and still surfaces
as an empty/partial list. That's exactly the "two isolated runs read 3
devices and 0" case from AprilNEA#218.
- openlogi-hid: split the enumerate pass into `enumerate_reporting_health`,
 which also reports whether every probed node answered cleanly this pass
 (a probe timeout, open failure, or a receiver read short of its pairing
 count is unhealthy). The polling watcher ignores the flag (the ledger
 handles replay); the one-shot `enumerate()` retries on it, reusing the
 same enumerator so its ledger accumulates a snapshot to replay and the
 channel stays warm. AprilNEA#226's 5 s `HidppChannel::send` timeout keeps a dead
 probe fast, so the bounded retry (4 attempts, 300 ms apart) is cheap.
- openlogi-agent-core: factor the watcher tick->event decision into
 `WatchState::classify` (no behaviour change — `Err` already kept the last
 snapshot) and unit-test it: an enumerate failure after a success keeps the
 last snapshot, a clean empty `Ok` is still forwarded as a disconnect, and
 persistent initial failure reports `Unavailable` once then recovers.
 (Invited in the AprilNEA#222 review; the watcher was untouched by AprilNEA#222.)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@buliwyf42 buliwyf42 force-pushed the fix/oneshot-enumerate-retry-and-watcher-classify branch from e08443d to f682591 Compare June 13, 2026 16:29

Copy link
Copy Markdown
Author

Re: the Unifying health-signal P2 — good catch, but that coarseness is by design from #222, not introduced here. The Unifying path only surfaces online devices (via arrival events) — it doesn't poll offline slots the way Bolt does — so paired.len() == pairing_count can't be its health signal: a legitimately-offline Unifying device would fall short of the count every cycle and wedge it into a permanent "unhealthy" retry. Hence the coarser pairing_count.is_some().

This PR's one-shot retry just keys off whatever healthy the probe already reports, so it inherits that asymmetry rather than changing it. A tighter Unifying signal (e.g. an online-count baseline) is a reasonable follow-up, but it has to account for offline devices and arrival-drain timing, so I'd keep it out of this small retry-focused change. (The #218 repro is a Bolt receiver, where the signal is already tight.)

Addressed the loop-range nit, and rebased onto master to resolve the #223 conflict — back to MERGEABLE.

FWIW: I re-tested this branch against the real repro rig (macOS 27 beta, MX Master 4 on Bolt). The one-shot retry demonstrably works — openlogi list hit 3 deep-probe timeouts but returned all 3 devices on the 4th attempt. But the deeper #218 failure (a sustained deep-probe hang that only a receiver replug clears) is separate and survives; I've written it up on #218.

buliwyf42 added a commit to buliwyf42/OpenLogi that referenced this pull request Jun 13, 2026
... the whole receiver (AprilNEA#218)
On the AprilNEA#218 repro rig (macOS 27 beta, MX Master 4 on a Bolt receiver) the
device list still dropped to "No devices" under a *sustained* failure that
AprilNEA#222's ledger and AprilNEA#237's retry don't cover: a single paired online device's
deep feature walk (`probe_features` / `Device::new`) hangs every cycle,
burning the whole 5 s `PROBE_BUDGET`. Because `probe_one` is the unit wrapped
in `timeout(PROBE_BUDGET)`, firing it discards the entire node — including the
pairing-register reads that succeeded — so the receiver yields nothing.
Confirmed device/OS-level (a standalone `openlogi list` hangs identically);
only a physical receiver replug cleared it.
The Unifying slot probe already guards exactly this with `UNIFYING_SLOT_PROBE`
(a per-slot cap, so a slow walk returns partial instead of starving the
budget). The Bolt slot probe had no equivalent — `probe_or_reuse` was awaited
bare, so one hung device blew the whole-receiver budget.
Mirror that guard to Bolt: a new `BOLT_SLOT_PROBE` (1.5 s) wraps the Bolt
slot's `probe_or_reuse`; on timeout the slot falls back to its cached /
identity data (codename + kind + online come from the pairing register, which
reads fine every cycle) instead of the whole `probe_one` timing out. A hung
device keeps showing with its last-known capabilities while the rest of the
receiver enumerates. 1.5 s is well above a healthy BTLE walk (sub-second) and
small enough that two simultaneously-hung slots still fit `PROBE_BUDGET` after
the arrival drain; the rare all-slots-hung case degrades to the existing
per-node ledger replay (AprilNEA#222).
Refs AprilNEA#218, AprilNEA#222.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
buliwyf42 added a commit to buliwyf42/OpenLogi that referenced this pull request Jun 13, 2026
... the whole receiver (AprilNEA#218)
On the AprilNEA#218 repro rig (macOS 27 beta, MX Master 4 on a Bolt receiver) the
device list still dropped to "No devices" under a *sustained* failure that
AprilNEA#222's ledger and AprilNEA#237's retry don't cover: a single paired online device's
deep feature walk (`probe_features` / `Device::new`) hangs every cycle,
burning the whole 5 s `PROBE_BUDGET`. Because `probe_one` is the unit wrapped
in `timeout(PROBE_BUDGET)`, firing it discards the entire node — including the
pairing-register reads that succeeded — so the receiver yields nothing.
Confirmed device/OS-level (a standalone `openlogi list` hangs identically);
only a physical receiver replug cleared it.
The Unifying slot probe already guards exactly this with `UNIFYING_SLOT_PROBE`
(a per-slot cap, so a slow walk returns partial instead of starving the
budget). The Bolt slot probe had no equivalent — `probe_or_reuse` was awaited
bare, so one hung device blew the whole-receiver budget.
Mirror that guard to Bolt: a new `BOLT_SLOT_PROBE` (1.5 s) wraps the Bolt
slot's `probe_or_reuse`; on timeout the slot falls back to its cached /
identity data (codename + kind + online come from the pairing register, which
reads fine every cycle) instead of the whole `probe_one` timing out. A hung
device keeps showing with its last-known capabilities while the rest of the
receiver enumerates. 1.5 s is well above a healthy BTLE walk (sub-second) and
small enough that two simultaneously-hung slots still fit `PROBE_BUDGET` after
the arrival drain; the rare all-slots-hung case degrades to the existing
per-node ledger replay (AprilNEA#222).
Refs AprilNEA#218, AprilNEA#222.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

@greptile-apps greptile-apps[bot] greptile-apps[bot] left review comments

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

1 participant

AltStyle によって変換されたページ (->オリジナル) /