permissions: contents: write escalation on PR #1 and, at identical risk, says nothing on PR #2. So the real test for "make this deterministic" is "can I afford this check to miss differently every run," not "is it mechanical" — and a couple of judgment-flavored checks (did this PR widen the trust boundary?) end up on the deterministic side for that reason alone.
The part I'd push hardest on: determinism buys you reproducibility, not salience. "Start in warn mode, promote low-noise findings" reads like a politeness step, but the noise level is the load-bearing variable. A check that fires on every dependency bump trains reviewers to wave the yellow banner through — and then the one real postinstall backdoor rides in under that same banner. Determinism guarantees the check runs the same way every time; it guarantees nothing about whether a human reads it. So warn-mode's real job isn't gentleness, it's measuring each signal's precision in this repo — a deterministic check with a 2% hit rate is functionally an LLM reviewer nobody trusts, just reached by a different road.
One more on #4: for an agent PR, the presence of a matching test is weaker evidence than for a human one. The same agent wrote the code and the test against the same (possibly wrong) reading of the intent, so a green matching test certifies self-consistency, not correctness. "Tests changed" should raise confidence less when the test author and the code author are the same process — which is the actual argument for keeping #4 as evidence and never a gate.