Copied to Clipboard
using AI to grade AI is relevant here, because it's exactly that common shortcut that proved blind to this problem.
This is a contrarian result in a field where "more safety" is the default applause line. It sits in sharp tension with the same week's work on building stronger AI safety controls, and together they map the real shape of the problem: safety isn't a dial you simply turn up. Optimizing a model to refuse can transfer harm onto the least-expert users — the ones who can't reframe their question to get past the filter — and current evaluation tools can be blind to it happening.
The authors offer an important caveat: the scenarios were deliberately engineered to create collisions between safety and helpfulness, so the rates they report describe the test's design, not how often this happens in everyday use. This is not evidence that medical AI is broadly harmful. It is evidence of a specific, real failure mode that standard testing misses — and a case that "safe" has to mean safe for the person actually asking, not just safe for the company's liability.
Originally published on Ground Truth, where every claim is checked against the primary source.