Attempt 2: the scoring challenge.
Stop grading free-form answers. Fix a token sequence and ask the model to score it: the log-probability it assigns to that sequence, teacher-forced, one forward pass, no sampling. A model assigns higher probability to text it would itself produce, so for the same fixed sequence the genuine model is measurably more confident than a different one.
Here are the numbers, scored on my own machines with the Qwen2.5 family:
| comparison |
mean gap (nats/token) |
genuine wins |
| honest floor (same model, q4 vs q8) |
about 0.00 (std 0.07) |
n/a |
| 1.5B impostor (2x cheaper) |
+0.27 |
8 of 10 |
| 0.5B impostor (6x cheaper) |
+0.66 |
10 of 10 |
The catch: one check is not enough.
That honest floor row is the important one. The same model at two quantizations drifts about 0.07 nats per token, centered on zero. The 2x-impostor signal of 0.27 is only about three times that, and on short, low-entropy outputs the two distributions overlap. A single scoring challenge cannot separate a 2x-cheaper impostor from an honest server running a different quant.
The means are clearly distinct though, so it works as an accumulating signal. With honest standard deviation around 0.07 and an impostor mean around 0.27, a running average over roughly 10 to 15 challenges separates them with confidence. So this is a slow background audit, not a one-shot test. Difficulty scales with how close the impostor is: a 6x downgrade falls out in a few checks, a 2x needs about a dozen, and a very close swap or a light quant downgrade may be impractical.
A gotcha that cost me an hour.
I first got nonsense numbers, about -10 nats for tokens like "of" and "is", which is worse than uniform-random over the vocabulary. The cause was that in llama-cpp-python 0.3.23 the high-level create_completion logprobs are wrong. The fix is to read per-position logits straight from the context and compute the log-softmax yourself. Sanity-check any logprob pipeline against a known sentence first. English should land around 0.5 to 1.5 bits per byte under a decent model. If you see 5, your scorer is broken, not the model.
The honest limit.
This needs real logprob access to the model under test: open weights you serve, or a provider that exposes proper logprobs and lets you score an arbitrary sequence. Fully closed APIs that only return text are a harder problem, and I do not have a clean answer there yet. For open-weight serving, which covers most self-hosting and a good chunk of the hosted market, the scoring challenge is a usable audit.
The takeaway: you can verify an open-weight model is what it claims, but only statistically, over many checks, and the intuitive method does the opposite of what you want. I think that pattern, where the obvious metric is backwards and the real signal needs accumulation, shows up all over verification.