Systematic testing of three Language Models reveals low language accuracy, absence of response stability, and a yes-response bias

doi:10.1073/pnas.2309583120

. 2023 Dec 19;120(51):e2309583120.

doi: 10.1073/pnas.2309583120. Epub 2023 Dec 13.

Systematic testing of three Language Models reveals low language accuracy, absence of response stability, and a yes-response bias

Vittoria Dentella ¹, Fritz Günther ², Evelina Leivada ^{3

4}

Affiliations

¹ Departament d'Estudis Anglesos i Alemanys, Universitat Rovira i Virgili, Tarragona 43002, Spain.
² Institut für Psychologie, Humboldt-Universitat zu Berlin, Berlin 10099, Germany.
³ Departament de Filologia Catalana, Universitat Autònoma de Barcelona, Barcelona 08193, Spain.
⁴ Institució Catalana de Recerca i Estudis Avançats (ICREA), Barcelona 08010, Spain.

PMID: 38091290
PMCID: PMC10743380
DOI: 10.1073/pnas.2309583120

Systematic testing of three Language Models reveals low language accuracy, absence of response stability, and a yes-response bias

Vittoria Dentella et al. Proc Natl Acad Sci U S A. 2023.

. 2023 Dec 19;120(51):e2309583120.

doi: 10.1073/pnas.2309583120. Epub 2023 Dec 13.

Authors

Vittoria Dentella ¹, Fritz Günther ², Evelina Leivada ^{3

4}

Affiliations

¹ Departament d'Estudis Anglesos i Alemanys, Universitat Rovira i Virgili, Tarragona 43002, Spain.
² Institut für Psychologie, Humboldt-Universitat zu Berlin, Berlin 10099, Germany.
³ Departament de Filologia Catalana, Universitat Autònoma de Barcelona, Barcelona 08193, Spain.
⁴ Institució Catalana de Recerca i Estudis Avançats (ICREA), Barcelona 08010, Spain.

PMID: 38091290
PMCID: PMC10743380
DOI: 10.1073/pnas.2309583120

Abstract

Humans are universally good in providing stable and accurate judgments about what forms part of their language and what not. Large Language Models (LMs) are claimed to possess human-like language abilities; hence, they are expected to emulate this behavior by providing both stable and accurate answers, when asked whether a string of words complies with or deviates from their next-word predictions. This work tests whether stability and accuracy are showcased by GPT-3/text-davinci-002, GPT-3/text-davinci-003, and ChatGPT, using a series of judgment tasks that tap on 8 linguistic phenomena: plural attraction, anaphora, center embedding, comparatives, intrusive resumption, negative polarity items, order of adjectives, and order of adverbs. For every phenomenon, 10 sentences (5 grammatical and 5 ungrammatical) are tested, each randomly repeated 10 times, totaling 800 elicited judgments per LM (total n = 2,400). Our results reveal variable above-chance accuracy in the grammatical condition, below-chance accuracy in the ungrammatical condition, a significant instability of answers across phenomena, and a yes-response bias for all the tested LMs. Furthermore, we found no evidence that repetition aids the Models to converge on a processing strategy that culminates in stable answers, either accurate or inaccurate. We demonstrate that the LMs' performance in identifying (un)grammatical word patterns is in stark contrast to what is observed in humans (n = 80, tested on the same tasks) and argue that adopting LMs as theories of human language is not motivated at their current stage of development.

Keywords: Language Models; bias; cognitive models; language.

PubMed Disclaimer

Conflict of interest statement

Competing interests statement:The authors declare no competing interest.

Figures

Fig. 1.

Fig. 1.

(A) Mean accuracy by condition and model: (A1) individual responses; (A2) preferred responses per sentence. (B) Mean accuracy by phenomenon and condition. The dashed black line indicates the mean accuracy for each phenomenon across both conditions.

Fig. 2.

Fig. 2.

Response instability by model and condition. (A) Instability measured as likelihood of oscillations. (B) Instability measured as the number of deviations.

Fig. 3.

Fig. 3.

The effect of repetitions on mean accuracy by model and condition. The transparent points represent the observed data; the opaque points represent the predictions of the GLMM including a three-way interaction between the factors.

Fig. 4.

Fig. 4.

Response instability by model and repetitions, measured as the likelihood of oscillations.

Fig. 5.

Fig. 5.

(A): (A1) Mean accuracy by type of responding agent and condition. (A2) Mean accuracy by type of responding agent and phenomenon. (B) Response instability by type of responding agent and condition: (B1) Instability measured as likelihood of oscillations; (B2) Instability measured as the number of deviations. (C) The impact of repetitions on accuracy by type of responding agent and condition. (D) The impact of repetitions on stability by type of responding agent.

See this image and copyright information in PMC

References

1. Arehalli S., Linzen T., Neural networks as cognitive models of the processing of syntactic constraints. https://sarehalli.github.io/Arehalli_AgrAttr.pdf (2022). - PMC - PubMed
1. Merrill W., Warstadt A., Linzen T., "Entailment semantics can be extracted from an ideal language model" in Proceedings of the 26th Conference on Computational Natural Language Learning, A. Fokkens, V. Srikumar, Eds. (Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (hybrid), 2022), pp. 176–193.
1. Baroni M., "On the proper role of linguistically oriented deep net analysis in linguistic theorizing" in Algebraic Structures in Natural Language, S. Lappin, J. -P. Bernardy, Eds. (CRC Press, 2022).
1. van Rooij I., Psychological models and their distractors. Nat. Rev. Psychol. 3, 127–128 (2022).
1. Dentella V., Murphy E., Marcus G., Leivada E., Testing AI performance on less frequent aspects of language reveals insensitivity to underlying meaning. arXiv [Preprint] (2023). 10.48550/arXiv.2302.12313 (Accessed 7 January 2023). - DOI

MeSH terms

Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

[1] Arehalli S., Linzen T., Neural networks as cognitive models of the processing of syntactic constraints. https://sarehalli.github.io/Arehalli_AgrAttr.pdf (2022). - PMC - PubMed

[2] Arehalli S., Linzen T., Neural networks as cognitive models of the processing of syntactic constraints. https://sarehalli.github.io/Arehalli_AgrAttr.pdf (2022). - PMC - PubMed

[3] Merrill W., Warstadt A., Linzen T., "Entailment semantics can be extracted from an ideal language model" in Proceedings of the 26th Conference on Computational Natural Language Learning, A. Fokkens, V. Srikumar, Eds. (Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (hybrid), 2022), pp. 176–193.

[4] Merrill W., Warstadt A., Linzen T., "Entailment semantics can be extracted from an ideal language model" in Proceedings of the 26th Conference on Computational Natural Language Learning, A. Fokkens, V. Srikumar, Eds. (Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (hybrid), 2022), pp. 176–193.

[5] Baroni M., "On the proper role of linguistically oriented deep net analysis in linguistic theorizing" in Algebraic Structures in Natural Language, S. Lappin, J. -P. Bernardy, Eds. (CRC Press, 2022).

[6] Baroni M., "On the proper role of linguistically oriented deep net analysis in linguistic theorizing" in Algebraic Structures in Natural Language, S. Lappin, J. -P. Bernardy, Eds. (CRC Press, 2022).

[7] van Rooij I., Psychological models and their distractors. Nat. Rev. Psychol. 3, 127–128 (2022).

[8] van Rooij I., Psychological models and their distractors. Nat. Rev. Psychol. 3, 127–128 (2022).

[9] Dentella V., Murphy E., Marcus G., Leivada E., Testing AI performance on less frequent aspects of language reveals insensitivity to underlying meaning. arXiv [Preprint] (2023). 10.48550/arXiv.2302.12313 (Accessed 7 January 2023). - DOI

[10] Dentella V., Murphy E., Marcus G., Leivada E., Testing AI performance on less frequent aspects of language reveals insensitivity to underlying meaning. arXiv [Preprint] (2023). 10.48550/arXiv.2302.12313 (Accessed 7 January 2023). - DOI

Account

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Systematic testing of three Language Models reveals low language accuracy, absence of response stability, and a yes-response bias

Affiliations

Systematic testing of three Language Models reveals low language accuracy, absence of response stability, and a yes-response bias

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous