Can You Tell When an LLM API Swaps in a Cheaper Model?

DEV Community

Attempt 2: the scoring challenge.

Stop grading free-form answers. Fix a token sequence and ask the model to score it: the log-probability it assigns to that sequence, teacher-forced, one forward pass, no sampling. A model assigns higher probability to text it would itself produce, so for the same fixed sequence the genuine model is measurably more confident than a different one.

Here are the numbers, scored on my own machines with the Qwen2.5 family:

comparison	mean gap (nats/token)	genuine wins
honest floor (same model, q4 vs q8)	about 0.00 (std 0.07)	n/a
1.5B impostor (2x cheaper)	+0.27	8 of 10
0.5B impostor (6x cheaper)	+0.66	10 of 10

The catch: one check is not enough.

That honest floor row is the important one. The same model at two quantizations drifts about 0.07 nats per token, centered on zero. The 2x-impostor signal of 0.27 is only about three times that, and on short, low-entropy outputs the two distributions overlap. A single scoring challenge cannot separate a 2x-cheaper impostor from an honest server running a different quant.

The means are clearly distinct though, so it works as an accumulating signal. With honest standard deviation around 0.07 and an impostor mean around 0.27, a running average over roughly 10 to 15 challenges separates them with confidence. So this is a slow background audit, not a one-shot test. Difficulty scales with how close the impostor is: a 6x downgrade falls out in a few checks, a 2x needs about a dozen, and a very close swap or a light quant downgrade may be impractical.

A gotcha that cost me an hour.

I first got nonsense numbers, about -10 nats for tokens like "of" and "is", which is worse than uniform-random over the vocabulary. The cause was that in llama-cpp-python 0.3.23 the high-level create_completion logprobs are wrong. The fix is to read per-position logits straight from the context and compute the log-softmax yourself. Sanity-check any logprob pipeline against a known sentence first. English should land around 0.5 to 1.5 bits per byte under a decent model. If you see 5, your scorer is broken, not the model.

The honest limit.

This needs real logprob access to the model under test: open weights you serve, or a provider that exposes proper logprobs and lets you score an arbitrary sequence. Fully closed APIs that only return text are a harder problem, and I do not have a clean answer there yet. For open-weight serving, which covers most self-hosting and a good chunk of the hosted market, the scoring challenge is a usable audit.

The takeaway: you can verify an open-weight model is what it claims, but only statistically, over many checks, and the intuitive method does the opposite of what you want. I think that pattern, where the obvious metric is backwards and the real signal needs accumulation, shows up all over verification.

Top comments (3)

max_quimby profile image

Max Quimby

Tech lead in computeleap.com

Location

Washington
Joined

Mar 14, 2026

• Jun 21

The perplexity result being backwards is a genuinely good counterintuitive finding — "flag the improbable answer" rewarding the dumber model is the kind of thing you only learn by running it. The teacher-forced scoring-challenge approach is clean, and framing it as an accumulating signal rather than a one-shot test is exactly right; what you've basically built is a sequential hypothesis test, and you could likely formalize it with an SPRT to stop as soon as you hit a confidence threshold instead of fixing 10–15 challenges. Two practical thoughts: (1) the gap-per-challenge depends on entropy, so a deliberately high-entropy canary set (rare continuations, mixed languages, code) should widen the genuine-vs-impostor separation and cut the number of challenges you need; (2) the real-world snag is that most hosted APIs no longer return logprobs or support echo/teacher-forcing, so this is strongest when you control the endpoint. Have you thought about a sampled-only fallback — e.g. agreement rate against a reference distribution — for the black-box case where logprobs aren't available?

yiqinumber1 profile image

Storm Engine Technology.

Building deterministic LLM inference for agent pipelines. Running Qwen2.5-32B on a DGX Spark.

Work

Nanjing Storm Engine Technology Co., Ltd.
Joined

Jun 2, 2026

• Jun 16

It seems like detecting when an LLM API swaps in a cheaper model is quite challenging. The initial approach of evaluating output quality didn't work as expected, with cheaper models producing lower perplexity scores due to more predictable text. However, a more effective method involves fixed sequence scoring challenges. By asking the model to score a fixed token sequence, you can detect discrepancies between the expected model and a cheaper substitute. This method works best when you accumulate evidence over multiple checks, making it a slow background audit rather than a one-shot test.
For instance, using Qwen2.5, a 1.5B model impostor showed an average gap of +0.27 nats/token, while a 0.5B model impostor showed +0.66 nats/token. Accumulating around 10 to 15 such checks can reliably distinguish between the models.
However, this method requires access to the model's log probabilities, which might not be available for fully closed APIs. Nonetheless, for open-weight models, this scoring challenge provides a usable audit mechanism.

newtorob profile image

Rob

Full time Staff SRE and Founder of Strake. Building operational tooling for engineering teams without a dedicated SRE. Writing about the ops mistakes I made so you don't have to.

Email

newtron54@gmail.com
Location

Chattanooga
Education

Oregon State University
Work

Staff SRE
Joined

Mar 1, 2017

• Jun 17

Exactly right. Great comment.