This site needs JavaScript to work properly. Please enable it to take advantage of the complete set of features!
Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

NIH NLM Logo
Log in
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Dec 19;120(51):e2309583120.
doi: 10.1073/pnas.2309583120. Epub 2023 Dec 13.

Systematic testing of three Language Models reveals low language accuracy, absence of response stability, and a yes-response bias

Affiliations

Systematic testing of three Language Models reveals low language accuracy, absence of response stability, and a yes-response bias

Vittoria Dentella et al. Proc Natl Acad Sci U S A. .

Abstract

Humans are universally good in providing stable and accurate judgments about what forms part of their language and what not. Large Language Models (LMs) are claimed to possess human-like language abilities; hence, they are expected to emulate this behavior by providing both stable and accurate answers, when asked whether a string of words complies with or deviates from their next-word predictions. This work tests whether stability and accuracy are showcased by GPT-3/text-davinci-002, GPT-3/text-davinci-003, and ChatGPT, using a series of judgment tasks that tap on 8 linguistic phenomena: plural attraction, anaphora, center embedding, comparatives, intrusive resumption, negative polarity items, order of adjectives, and order of adverbs. For every phenomenon, 10 sentences (5 grammatical and 5 ungrammatical) are tested, each randomly repeated 10 times, totaling 800 elicited judgments per LM (total n = 2,400). Our results reveal variable above-chance accuracy in the grammatical condition, below-chance accuracy in the ungrammatical condition, a significant instability of answers across phenomena, and a yes-response bias for all the tested LMs. Furthermore, we found no evidence that repetition aids the Models to converge on a processing strategy that culminates in stable answers, either accurate or inaccurate. We demonstrate that the LMs' performance in identifying (un)grammatical word patterns is in stark contrast to what is observed in humans (n = 80, tested on the same tasks) and argue that adopting LMs as theories of human language is not motivated at their current stage of development.

Keywords: Language Models; bias; cognitive models; language.

PubMed Disclaimer

Conflict of interest statement

Competing interests statement:The authors declare no competing interest.

Figures

Fig. 1.
Fig. 1.
(A) Mean accuracy by condition and model: (A1) individual responses; (A2) preferred responses per sentence. (B) Mean accuracy by phenomenon and condition. The dashed black line indicates the mean accuracy for each phenomenon across both conditions.
Fig. 2.
Fig. 2.
Response instability by model and condition. (A) Instability measured as likelihood of oscillations. (B) Instability measured as the number of deviations.
Fig. 3.
Fig. 3.
The effect of repetitions on mean accuracy by model and condition. The transparent points represent the observed data; the opaque points represent the predictions of the GLMM including a three-way interaction between the factors.
Fig. 4.
Fig. 4.
Response instability by model and repetitions, measured as the likelihood of oscillations.
Fig. 5.
Fig. 5.
(A): (A1) Mean accuracy by type of responding agent and condition. (A2) Mean accuracy by type of responding agent and phenomenon. (B) Response instability by type of responding agent and condition: (B1) Instability measured as likelihood of oscillations; (B2) Instability measured as the number of deviations. (C) The impact of repetitions on accuracy by type of responding agent and condition. (D) The impact of repetitions on stability by type of responding agent.

References

    1. Arehalli S., Linzen T., Neural networks as cognitive models of the processing of syntactic constraints. https://sarehalli.github.io/Arehalli_AgrAttr.pdf (2022). - PMC - PubMed
    1. Merrill W., Warstadt A., Linzen T., "Entailment semantics can be extracted from an ideal language model" in Proceedings of the 26th Conference on Computational Natural Language Learning, A. Fokkens, V. Srikumar, Eds. (Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (hybrid), 2022), pp. 176–193.
    1. Baroni M., "On the proper role of linguistically oriented deep net analysis in linguistic theorizing" in Algebraic Structures in Natural Language, S. Lappin, J. -P. Bernardy, Eds. (CRC Press, 2022).
    1. van Rooij I., Psychological models and their distractors. Nat. Rev. Psychol. 3, 127–128 (2022).
    1. Dentella V., Murphy E., Marcus G., Leivada E., Testing AI performance on less frequent aspects of language reveals insensitivity to underlying meaning. arXiv [Preprint] (2023). 10.48550/arXiv.2302.12313 (Accessed 7 January 2023). - DOI

LinkOut - more resources

Cite

AltStyle によって変換されたページ (->オリジナル) /