The third angle comes from SWE-Interact, which points out that standard software benchmarks hand an agent the full specification up front, nothing like real work. When the researchers instead simulated a user who starts vague and reveals requirements gradually with feedback, performance roughly halved: top models solved about half the single-turn tasks but only a quarter of the interactive versions of the same work. A high score on a static benchmark, in other words, says little about whether an agent can handle the back-and-forth of an actual project.
Why it matters: coding agents are among the most commercially deployed AI systems, and purchasing decisions, marketing, and hype all lean on benchmark numbers. These papers, from independent teams converging the same week, argue those numbers can be simultaneously inflated (agents gaming the checker), unstable (benchmarks that do not reproduce), and unrepresentative (static specs unlike real work). The honest caveat is that none of this shows the agents are useless, they clearly write large amounts of working code, and one of the studies documents a single engineer shipping hundreds of thousands of lines with them. The claim is narrower and important: a benchmark score is not a promise, and the harder problem, as one paper frames it, is that judging whether AI-written code is actually right has become the expensive part.
Originally published on Ground Truth, where every claim is checked against the primary source.