Neural Networks as Cognitive Models of the Processing of Syntactic Constraints

doi:10.1162/opmi_a_00137

. 2024 May 6:8:558-614.

doi: 10.1162/opmi_a_00137. eCollection 2024.

Neural Networks as Cognitive Models of the Processing of Syntactic Constraints

Suhas Arehalli ¹, Tal Linzen ²

Affiliations

PMID: 38746852
PMCID: PMC11093404
DOI: 10.1162/opmi_a_00137

Neural Networks as Cognitive Models of the Processing of Syntactic Constraints

Suhas Arehalli et al. Open Mind (Camb). 2024.

. 2024 May 6:8:558-614.

doi: 10.1162/opmi_a_00137. eCollection 2024.

Authors

Suhas Arehalli ¹, Tal Linzen ²

Affiliations

¹ Department of Mathematics, Statistics, and Computer Science, Macalester College, Saint Paul, MN, USA.
² Department of Linguistics and Center for Data Science, New York University, New York, NY, USA.

PMID: 38746852
PMCID: PMC11093404
DOI: 10.1162/opmi_a_00137

Abstract

Languages are governed by syntactic constraints-structural rules that determine which sentences are grammatical in the language. In English, one such constraint is subject-verb agreement, which dictates that the number of a verb must match the number of its corresponding subject: "the dogs run", but "the dog runs". While this constraint appears to be simple, in practice speakers make agreement errors, particularly when a noun phrase near the verb differs in number from the subject (for example, a speaker might produce the ungrammatical sentence "the key to the cabinets are rusty"). This phenomenon, referred to as agreement attraction, is sensitive to a wide range of properties of the sentence; no single existing model is able to generate predictions for the wide variety of materials studied in the human experimental literature. We explore the viability of neural network language models-broad-coverage systems trained to predict the next word in a corpus-as a framework for addressing this limitation. We analyze the agreement errors made by Long Short-Term Memory (LSTM) networks and compare them to those of humans. The models successfully simulate certain results, such as the so-called number asymmetry and the difference between attraction strength in grammatical and ungrammatical sentences, but failed to simulate others, such as the effect of syntactic distance or notional (conceptual) number. We further evaluate networks trained with explicit syntactic supervision, and find that this form of supervision does not always lead to more human-like syntactic behavior. Finally, we show that the corpus used to train a network significantly affects the pattern of agreement errors produced by the network, and discuss the strengths and limitations of neural networks as a tool for understanding human syntactic processing.

Keywords: agreement attraction; computational modeling; neural networks; psycholinguistics; syntactic processing.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors declare no conflict of interests.

Figures

Figure 1.

Figure 1.

In our language modeling setup, each word is mapped to a word vector. Each of those representations is combined with a representation of all previous words (h_i−1) using a recurrent neural network model (RNN) to create a representation h_i for all words up to word i. To generate a prediction for word i, h_i is fed into a linear decoder (L) to generate a distribution over word i. During training, model weights (which determine RNN and L) are adjusted to maximize the probability of the word that actually occurred in the sentence at position i.

Figure 2.

Figure 2.

An example sequence of CCG supertags for the sentence The key to the cabinets is rusty. Each supertag encodes how the corresponding word composes with its syntactic neighborhood. The label Y/X denotes that the word it labels merges with a constituent of type X on its right to form a constituent of type Y (as with the and key), and Y\X denotes the same, but with the constituent of type X on its left (as with to the cabinets and the key). To predict supertags successfully, models must learn to represent something akin to the underlying structure of the sentence. In many cases, knowing the sequence of supertags makes it possible to deterministically reconstruct the full parse of the sentence.

Figure 3.

Figure 3.

An outline of the architecture used for the LM+CCG models. Using the internal representation h₅ constructed by an RNN encoder, classifier L₁ generates a probability distribution over possible next words w* and classifier L₂ generates a probability distribution over possible supertags c* for the current word.

Figure 4.

Figure 4.

To simulate a sentence completion experiment, a language model is given each preamble as input, producing a probability distribution over the following word (a). The probabilities of a candidate singular and plural verb are extracted from this distribution (b) and renormalized (c) and this new distribution is taken to represent the probability with which the model would produce a singular or plural verb.

Figure 5.

Figure 5.

Human and simulation results for Bock and Cutting (1992). Vertical bars represent the size of the attraction effect: the difference between the subject-attractor number match condition (the lower, circular endpoints) and mismatch condition (the higher, square endpoints). Error bars represent standard errors across the five randomly initialized models trained for each model architecture and training set. If the models simulate the relevant result from Bock and Cutting (1992), the attraction effect in RCs (the length of the solid red bar) is smaller than that in PPs (the length of the dashed blue-green bar). This pattern is reversed in LM-Only models trained on the WSJ Corpus, and no significant difference is found between modifier types in all other models.

Figure 6.

Figure 6.

A simplified syntactic representation of Example 11. Even though the first attractor, the president(s), is more distant from the eventual position of the verb (within the T′) than the second attractor, the company(s), it is closer to the verb in the syntactic structure: fewer nodes need to be crossed to reach T′ from president(s).

Figure 7.

Figure 7.

Human and simulation results for Franck et al. (2002). Vertical bars represent the size of the attraction effect: the difference between the subject-attractor number match condition (the lower, square endpoints) and mismatch condition (the higher, circular endpoints). These attraction effects are shown for the syntactically closer attractor (to the left of each facet) and the linearly closer attractor (to the right of each facet), marginalizing over the condition of the other attractor. Error bars for the LSTMs represent standard errors across the five randomly initialized models trained for each model training objective and training set. Crucially, in humans, the attraction effect from syntactically closer attractors is greater than that of linearly closer attractors. The reverse is true for all of the models with the exception of GPT-2.

Figure 8.

Figure 8.

Human and simulation results for Haskell and MacDonald (2005). Vertical bars represent the size of the linear distance effect: the difference between plural agreement rates when the singular subject is closer to the verb position (the square endpoints) and when the plural subject is closer to the verb position (the circular endpoints). Error bars represent standard errors across the five randomly initialized models trained for each model architecture and training set. The size of the linear distance effect is represented by the length of the bar (all models had higher rates of plural agreement noun closer to the verb was plural than when it was singular). While all of the models exhibited some linear distance effect, the magnitude of the effect in humans was much larger than in any of the models.

Figure 9.

Figure 9.

Human and simulation results for Humphreys and Bock (2005). Endpoints represent the rate of plural agreement in the distributive-biased condition (circular endpoints) or the collective-biased condition (square endpoints). Error bars represent standard errors across the five randomly initialized models trained for each model architecture and training set. In humans, Humphreys and Bock (2005) observed higher rates of plural agreement when the reading of the collective subject was biased toward a distributive reading. We observe no such difference in any of the models’ results.

Figure 10.

Figure 10.

Word-by-word surprisals from our simulations and corresponding reading times from Exp. 1 of Parker and An (2018). Error bars are standard errors. Since effects in self-paced reading typically spill over into the reading times of the next few words, we provide two additional words for the human results. The relevant effect is found at unhappy in the human data, with the attraction effect in the oblique argument condition (the difference between dashed lines) being significantly larger than the attraction effect in the core argument condition (the difference between solid lines). We see no such difference in models other than GPT-2.

Figure 11.

Figure 11.

Surprisals for models in our simulation of Exp. 3 of Wagers et al. (2009) at the verb praise(s), where the grammaticality of the agreement relation within the RC becomes clear, compared to the human data from that experiment (right). Error bars are standard errors. We see a grammaticality asymmetry in both humans and models, reflected in that fact that attraction in ungrammatical sentences (the difference between the dashed lines) is stronger than in grammatical sentences (the difference between the solid lines).

Figure 12.

Figure 12.

Example (simplified) syntactic trees corresponding to the PP and RC conditions in Bock and Cutting (1992). Crucially, the attractor NP in embedded more deeply in the subject’s structure in the RC-modifier condition (12b) than in the PP-modifier condition (12a), resulting in a longer syntactic distance from the attractor to the inflected verb’s position.

Figure 13.

Figure 13.

The language modeling and CCG supertagging losses over the test set of one of our LM+CCG models with the output of one neuron in the final layer set to 0. Each dot represents the performance of the model ablating a particular final-layer neuron. Dashed lines represent the model’s performance with no neurons ablated. Lower losses indicate better performance.

Figure A1.

Figure A1.

Error rates from our simulations of Bock and Cutting (1992) averaging over 557 singular and plural verb pairs extracted from the WSJ Corpus.

Figure A2.

Figure A2.

Agreement Attraction effects (Subject-Attractor Mismatch minus Match Error Rates) from our simulations of Bock and Cutting (1992) for each of the 557 singular and plural verb pairs extracted from the WSJ Corpus.

Figure D1.

Figure D1.

Word-by-word surprisals for models in our simulation of grammatical materials from Parker and An (2018). Error bars are standard errors. Since models were given no context prior to the first word, no surprisal is given for the first word of the sentence (The). Since near only appears in the oblique argument condition, no surprisal is provided for the token in the core argument condition. The critical region here is at the verb was/were, where the grammaticality of the agreement relation becomes clear. If an attraction effect manifests in grammatical sentences, surprisal will be higher in the mismatch condition than for those in the mismatch condition.

Figure D2.

Figure D2.

Word-by-word surprisals for models in our simulation of ungrammatical sentences from Parker and An (2018). Error bars are standard errors. Since models were given no context prior to the first word, no surprisal is given for the first word of the sentence (The). Since near only appears in the oblique argument condition, no surprisal is provided for the token in the core argument condition. The critical region here is at the verb was/were, where the grammaticality of the agreement relation becomes clear. If such an effect manifests in ungrammatical sentences, surprisal will be lower in the mismatch condition than in the match condition.

Figure D3.

Figure D3.

Word-by-word surprisals for models in our simulation of sentences with a singular subject from Wagers et al. (2009). Error bars are standard errors. Since models were given no context prior to the first word, no surprisal is given for the first word of the sentence (The). The critical region here is at the verb praise(s), where the grammaticality of the agreement relation becomes clear. If an attraction effect manifests in grammatical sentences, surprisal will be higher in the mismatch condition than for those in the mismatch condition. If such an effect manifests in ungrammatical sentences, surprisal will be lower in the mismatch condition than in the match condition.

Figure D4.

Figure D4.

Word-by-word surprisals for models in our simulation of sentences with a plural subject from Wagers et al. (2009). Error bars are standard errors. Since models were given no context prior to the first word, no surprisal is given for the first word of the sentence (The). The critical region here is at the verb praise(s), where the grammaticality of the agreement relation becomes clear. If an attraction effect manifests in grammatical sentences, surprisal will be higher in the mismatch condition than for those in the mismatch condition. If such an effect manifests in ungrammatical sentences, surprisal will be lower in the mismatch condition than in the match condition.

See this image and copyright information in PMC

References

1. Badecker, W., & Kuminiak, F. (2007). Morphology, agreement and working memory retrieval in sentence production: Evidence from gender and case in Slovak. Journal of Memory and Language, 56(1), 65–85. 10.1016/j.jml.200608004 - DOI
1. Bangalore, S., & Joshi, A. K. (1999). Supertagging: An approach to almost parsing. Computational Linguistics, 25(2), 237–265.
1. Bender, E. M., & Koller, A. (2020). Climbing towards NLU: On meaning, form, and understanding in the age of data. In Jurafsky D., Chai J., Schluter N., & Tetreault J. (Eds.), Proceedings of 58th Annual Meeting of the Association for Computational Linguistics (pp. 5185–5198). Association for Computational Linguistics. 10.18653/v1/2020.acl-main.463 - DOI
1. Bhatt, G., Bansal, H., Singh, R., & Agarwal, S. (2020). How much complexity does an RNN architecture need to learn syntax-sensitive dependencies? In Rijhwani S., Liu J., Wang Y., & Dror R. (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop (pp. 244–254). Association for Computational Linguistics. 10.18653/v1/2020.acl-srw.33 - DOI
1. Bock, K., & Cutting, J. C. (1992). Regulating mental energy: Performance units in language production. Journal of Memory and Language, 31(1), 99–127. 10.1016/0749-596X(92)90007-K - DOI

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central

[1] Badecker, W., & Kuminiak, F. (2007). Morphology, agreement and working memory retrieval in sentence production: Evidence from gender and case in Slovak. Journal of Memory and Language, 56(1), 65–85. 10.1016/j.jml.200608004 - DOI

[2] Badecker, W., & Kuminiak, F. (2007). Morphology, agreement and working memory retrieval in sentence production: Evidence from gender and case in Slovak. Journal of Memory and Language, 56(1), 65–85. 10.1016/j.jml.200608004 - DOI

[3] Bangalore, S., & Joshi, A. K. (1999). Supertagging: An approach to almost parsing. Computational Linguistics, 25(2), 237–265.

[4] Bangalore, S., & Joshi, A. K. (1999). Supertagging: An approach to almost parsing. Computational Linguistics, 25(2), 237–265.

[5] Bender, E. M., & Koller, A. (2020). Climbing towards NLU: On meaning, form, and understanding in the age of data. In Jurafsky D., Chai J., Schluter N., & Tetreault J. (Eds.), Proceedings of 58th Annual Meeting of the Association for Computational Linguistics (pp. 5185–5198). Association for Computational Linguistics. 10.18653/v1/2020.acl-main.463 - DOI

[6] Bender, E. M., & Koller, A. (2020). Climbing towards NLU: On meaning, form, and understanding in the age of data. In Jurafsky D., Chai J., Schluter N., & Tetreault J. (Eds.), Proceedings of 58th Annual Meeting of the Association for Computational Linguistics (pp. 5185–5198). Association for Computational Linguistics. 10.18653/v1/2020.acl-main.463 - DOI

[7] Bhatt, G., Bansal, H., Singh, R., & Agarwal, S. (2020). How much complexity does an RNN architecture need to learn syntax-sensitive dependencies? In Rijhwani S., Liu J., Wang Y., & Dror R. (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop (pp. 244–254). Association for Computational Linguistics. 10.18653/v1/2020.acl-srw.33 - DOI

[8] Bhatt, G., Bansal, H., Singh, R., & Agarwal, S. (2020). How much complexity does an RNN architecture need to learn syntax-sensitive dependencies? In Rijhwani S., Liu J., Wang Y., & Dror R. (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop (pp. 244–254). Association for Computational Linguistics. 10.18653/v1/2020.acl-srw.33 - DOI

[9] Bock, K., & Cutting, J. C. (1992). Regulating mental energy: Performance units in language production. Journal of Memory and Language, 31(1), 99–127. 10.1016/0749-596X(92)90007-K - DOI

[10] Bock, K., & Cutting, J. C. (1992). Regulating mental energy: Performance units in language production. Journal of Memory and Language, 31(1), 99–127. 10.1016/0749-596X(92)90007-K - DOI

Account

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Neural Networks as Cognitive Models of the Processing of Syntactic Constraints

Affiliations

Neural Networks as Cognitive Models of the Processing of Syntactic Constraints

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources