-
-
Notifications
You must be signed in to change notification settings - Fork 91
fix(hid): retry the one-shot enumerate() through a transient probe miss (#218)#237
fix(hid): retry the one-shot enumerate() through a transient probe miss (#218) #237buliwyf42 wants to merge 1 commit into
Conversation
Greptile SummaryThis PR fixes the one-shot
Confidence Score: 5/5Safe to merge. The one-shot retry loop is tightly bounded (4 attempts, 300 ms apart), the polling watcher path is untouched, and the WatchState refactor is behaviour-preserving with unit-test coverage. No defects introduced. The health flag is correctly initialised and accumulated; zero-candidates returns immediately; open failures correctly mark the pass unhealthy; Unavailable fires exactly once for the current INITIAL_FAILURE_LIMIT value; and Enumerator reuse across retries makes the ledger replay work for one-shot callers. No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[enumerate free fn] --> B[fresh Enumerator]
B --> C[enumerate_reporting_health]
C --> D{OS-level error?}
D -- Err --> E[Propagate, no retry]
D -- Ok --> F{all_healthy?}
F -- true --> G[Return inventories]
F -- false --> H{attempt >= 4?}
H -- yes --> G
H -- no --> I[sleep 300ms, attempt++]
I --> C
Reviews (2): Last reviewed commit: "fix(hid): retry the one-shot enumerate()..." | Re-trigger Greptile |
...ss (AprilNEA#218) Follow-up to AprilNEA#222. The per-node `NodeLedger` from AprilNEA#222 replays a node's last good inventory across watcher ticks, but the one-shot `enumerate()` free fn (CLI `list` / `diag`) builds a fresh `Enumerator` whose ledger is empty — so a transient probe miss has nothing to replay and still surfaces as an empty/partial list. That's exactly the "two isolated runs read 3 devices and 0" case from AprilNEA#218. - openlogi-hid: split the enumerate pass into `enumerate_reporting_health`, which also reports whether every probed node answered cleanly this pass (a probe timeout, open failure, or a receiver read short of its pairing count is unhealthy). The polling watcher ignores the flag (the ledger handles replay); the one-shot `enumerate()` retries on it, reusing the same enumerator so its ledger accumulates a snapshot to replay and the channel stays warm. AprilNEA#226's 5 s `HidppChannel::send` timeout keeps a dead probe fast, so the bounded retry (4 attempts, 300 ms apart) is cheap. - openlogi-agent-core: factor the watcher tick->event decision into `WatchState::classify` (no behaviour change — `Err` already kept the last snapshot) and unit-test it: an enumerate failure after a success keeps the last snapshot, a clean empty `Ok` is still forwarded as a disconnect, and persistent initial failure reports `Unavailable` once then recovers. (Invited in the AprilNEA#222 review; the watcher was untouched by AprilNEA#222.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
e08443d to
f682591
Compare
buliwyf42
commented
Jun 13, 2026
Re: the Unifying health-signal P2 — good catch, but that coarseness is by design from #222, not introduced here. The Unifying path only surfaces online devices (via arrival events) — it doesn't poll offline slots the way Bolt does — so paired.len() == pairing_count can't be its health signal: a legitimately-offline Unifying device would fall short of the count every cycle and wedge it into a permanent "unhealthy" retry. Hence the coarser pairing_count.is_some().
This PR's one-shot retry just keys off whatever healthy the probe already reports, so it inherits that asymmetry rather than changing it. A tighter Unifying signal (e.g. an online-count baseline) is a reasonable follow-up, but it has to account for offline devices and arrival-drain timing, so I'd keep it out of this small retry-focused change. (The #218 repro is a Bolt receiver, where the signal is already tight.)
Addressed the loop-range nit, and rebased onto master to resolve the #223 conflict — back to MERGEABLE.
FWIW: I re-tested this branch against the real repro rig (macOS 27 beta, MX Master 4 on Bolt). The one-shot retry demonstrably works — openlogi list hit 3 deep-probe timeouts but returned all 3 devices on the 4th attempt. But the deeper #218 failure (a sustained deep-probe hang that only a receiver replug clears) is separate and survives; I've written it up on #218.
... the whole receiver (AprilNEA#218) On the AprilNEA#218 repro rig (macOS 27 beta, MX Master 4 on a Bolt receiver) the device list still dropped to "No devices" under a *sustained* failure that AprilNEA#222's ledger and AprilNEA#237's retry don't cover: a single paired online device's deep feature walk (`probe_features` / `Device::new`) hangs every cycle, burning the whole 5 s `PROBE_BUDGET`. Because `probe_one` is the unit wrapped in `timeout(PROBE_BUDGET)`, firing it discards the entire node — including the pairing-register reads that succeeded — so the receiver yields nothing. Confirmed device/OS-level (a standalone `openlogi list` hangs identically); only a physical receiver replug cleared it. The Unifying slot probe already guards exactly this with `UNIFYING_SLOT_PROBE` (a per-slot cap, so a slow walk returns partial instead of starving the budget). The Bolt slot probe had no equivalent — `probe_or_reuse` was awaited bare, so one hung device blew the whole-receiver budget. Mirror that guard to Bolt: a new `BOLT_SLOT_PROBE` (1.5 s) wraps the Bolt slot's `probe_or_reuse`; on timeout the slot falls back to its cached / identity data (codename + kind + online come from the pairing register, which reads fine every cycle) instead of the whole `probe_one` timing out. A hung device keeps showing with its last-known capabilities while the rest of the receiver enumerates. 1.5 s is well above a healthy BTLE walk (sub-second) and small enough that two simultaneously-hung slots still fit `PROBE_BUDGET` after the arrival drain; the rare all-slots-hung case degrades to the existing per-node ledger replay (AprilNEA#222). Refs AprilNEA#218, AprilNEA#222. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
... the whole receiver (AprilNEA#218) On the AprilNEA#218 repro rig (macOS 27 beta, MX Master 4 on a Bolt receiver) the device list still dropped to "No devices" under a *sustained* failure that AprilNEA#222's ledger and AprilNEA#237's retry don't cover: a single paired online device's deep feature walk (`probe_features` / `Device::new`) hangs every cycle, burning the whole 5 s `PROBE_BUDGET`. Because `probe_one` is the unit wrapped in `timeout(PROBE_BUDGET)`, firing it discards the entire node — including the pairing-register reads that succeeded — so the receiver yields nothing. Confirmed device/OS-level (a standalone `openlogi list` hangs identically); only a physical receiver replug cleared it. The Unifying slot probe already guards exactly this with `UNIFYING_SLOT_PROBE` (a per-slot cap, so a slow walk returns partial instead of starving the budget). The Bolt slot probe had no equivalent — `probe_or_reuse` was awaited bare, so one hung device blew the whole-receiver budget. Mirror that guard to Bolt: a new `BOLT_SLOT_PROBE` (1.5 s) wraps the Bolt slot's `probe_or_reuse`; on timeout the slot falls back to its cached / identity data (codename + kind + online come from the pairing register, which reads fine every cycle) instead of the whole `probe_one` timing out. A hung device keeps showing with its last-known capabilities while the rest of the receiver enumerates. 1.5 s is well above a healthy BTLE walk (sub-second) and small enough that two simultaneously-hung slots still fit `PROBE_BUDGET` after the arrival drain; the rare all-slots-hung case degrades to the existing per-node ledger replay (AprilNEA#222). Refs AprilNEA#218, AprilNEA#222. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Follow-up to #222 (merged), as invited in its review.
Gap
#222's per-node
NodeLedgerreplays a node's last good inventory across watcher ticks — that fixes the flapping device list. But the one-shotenumerate()free fn (used by the CLIlist/diagcommands) builds a freshEnumeratorper call, so its ledger is empty: a transient probe miss has nothing to replay and still surfaces as an empty/partial list. That's exactly the "two back-to-back isolated runs read 3 devices and 0" case in #218.Fix
openlogi-hid— split the enumerate pass intoenumerate_reporting_health, which also returns whether every probed node answered cleanly this pass (a probe timeout, an open failure, or a receiver read short of its pairing count is unhealthy).enumerate()and ignores the flag — the ledger already handles its per-node replay, so nothing changes there.enumerate()retries while a node is unhealthy, reusing the sameEnumeratorso its ledger accumulates a snapshot a later attempt can replay and the opened channel stays warm. Bounded to 4 attempts, 300 ms apart. fix: add a request timeout inside HidppChannel::send (fix hang root at the right layer) #226 's 5 sHidppChannel::sendtimeout makes a dead probe fail fast, so the retry is cheap.A legitimately-skipped node (a receiver's secondary HID interface) reports
healthy: trueafter a completed feature walk, so it doesn't trigger needless retries; zero candidates returns immediately (no spin).openlogi-agent-core— factor the watcher's tick→event decision intoWatchState::classifyand unit-test it (invited in the #222 review; the watcher was untouched there). No behaviour change — theErrarm already kept the last snapshot. Tests assert: an enumerate failure after a success keeps the last snapshot; a clean emptyOkis still forwarded as a genuine disconnect; persistent initial failure reportsUnavailableonce, then recovers. (Addsasync-hidas a dev-dependency only, to build anInventoryError::Hidin those tests.)Verification
I can't build
openlogi-guilocally (its Metal shaders need a full Xcode toolchain I don't have), but nothing here touches the GUI —InventoryErrorstays a single-variant enum with no exhaustive match anywhere, and CI covers the GUI.Refs #218, #222.