The results are stark. On "is this alive and what kind of animal," the vision-language baselines scored in the mid-90s; the robot models scored around 45-58%, no better than guessing. On celebrity recognition, baselines hit 99-100% while robot models fell to 38-55%. Even on basic object attributes, several robot models could not beat a coin flip on questions their own un-fine-tuned backbone answered two-thirds of the time. The single exception was color, where the robot models held up, precisely because color is directly useful when your training data is all about picking things up. As the authors put it, "VLAs show solid performance on simple concepts while exhibiting larger gaps on richer semantic categories relative to their source VLMs."
The analogy is a brilliant student who trains so hard for one specific job that they forget everything outside it. Ask them anything about their narrow task and they shine; ask them who painted the Mona Lisa and they blank, even though they knew it a year ago. The paper adds a hopeful wrinkle: probing the network layer by layer shows the correct answer often still lives in the model's middle layers. The knowledge is not fully erased, it just fails to reach the part of the network that chooses the action. The models that fared best, like Magma, were the ones trained with visual-question-answering mixed in alongside robot control.
Why this matters: robotics companies routinely market these models as broadly capable, open-world generalists. This is a rigorous, numbers-backed demonstration that a robot which looks competent at its trained tasks may have silently lost the everyday understanding you would assume it has, and standard robot benchmarks, which only measure task success, would never catch it. A home robot that cannot reliably tell whether something is alive is not a hypothetical safety concern.
The honest caveat: this measures knowledge retention, not real-world task performance, and one category (color) survived intact, so the loss is selective rather than total. It also does not prove the trade-off is unavoidable; the better-retaining models point to fixes like co-training on general questions. But it sharpens a warning the field has been circling, echoing earlier findings that world models forget what they are not actively using. Broad competence and narrow skill may not come free together, at least not yet.
Originally published on Ground Truth, where every claim is checked against the primary source.