Shared/mutable runners make this sneaky. At Semaphore we run each job in a fresh ephemeral VM so "same code → same CI state" is actually true rather than just assumed. The structured test reports help too — they carry stable IDs per test across runs, so the "missing test evidence" check has something real to anchor against rather than whatever happened to show up in that build's log.
Also curious — is evidenceSnapshot meant to be portable across CI platforms, or is it GHA-specific? The permission escalation check in particular seems like it'd look pretty different outside of GHA.
(Disclosure: I work at Semaphore.)