Evaluating Syntactic Abilities of Language Models

December 2, 2021

Posted by Jason Wei, AI Resident and Dan Garrette, Research Scientist, Google Research

In recent years, pre-trained language models, such as BERT and GPT-3, have seen widespread use in natural language processing (NLP). By training on large volumes of text, language models acquire broad knowledge about the world, achieving strong performance on various NLP benchmarks. These models, however, are often opaque in that it may not be clear why they perform so well, which limits further hypothesis-driven improvement of the models. Hence, a new line of scientific inquiry has arisen: what linguistic knowledge is contained in these models?

While there are many types of linguistic knowledge that one may want to investigate, a topic that provides a strong basis for analysis is the subject–verb agreement grammar rule in English, which requires that the grammatical number of a verb agree with that of the subject. For example, the sentence “The dogs run.” is grammatical because “dogs” and “run” are both plural, but “The dogs runs.” is ungrammatical because “runs” is a singular verb.

One framework for assessing the linguistic knowledge of a language model is targeted syntactic evaluation (TSE), in which minimally different pairs of sentences, one grammatical and one ungrammatical, are shown to a model, and the model must determine which one is grammatical. TSE can be used to test knowledge of the English subject–verb agreement rule by having the model judge between two versions of the same sentence: one where a particular verb is written in its singular form, and the other in which the verb is written in its plural form.

With the above context, in “Frequency Effects on Syntactic Rule-Learning in Transformers”, published at EMNLP 2021, we investigated how a BERT model’s ability to correctly apply the English subject–verb agreement rule is affected by the number of times the words are seen by the model during pre-training. To test specific conditions, we pre-trained BERT models from scratch using carefully controlled datasets. We found that BERT achieves good performance on subject–verb pairs that do not appear together in the pre-training data, which indicates that it does learn to apply subject–verb agreement. However, the model tends to predict the incorrect form when it is much more frequent than the correct form, indicating that BERT does not treat grammatical agreement as a rule that must be followed. These results help us to better understand the strengths and limitations of pre-trained language models.

Prior Work

Previous work used TSE to measure English subject–verb agreement ability in a BERT model. In this setup, BERT performs a fill-in-the-blank task (e.g., “the dog _ across the park”) by assigning probabilities to both the singular and plural forms of a given verb (e.g., “runs” and “run”). If the model has correctly learned to apply the subject–verb agreement rule, then it should consistently assign higher probabilities to the verb forms that make the sentences grammatically correct.

This previous work evaluated BERT using both natural sentences (drawn from Wikipedia) and nonce sentences, which are artificially constructed to be grammatically valid but semantically nonsensical, such as Noam Chomsky’s famous example “colorless green ideas sleep furiously”. Nonce sentences are useful when testing syntactic abilities because the model cannot just fall back on superficial corpus statistics: for example, while “dogs run” is much more common than “dogs runs”, “dogs publish” and “dogs publishes” will both be very rare, so a model is not likely to have simply memorized the fact that one of them is more likely than the other.

BERT achieves an accuracy of more than 80% on nonce sentences (far better than the random-chance baseline of 50%), which was taken as evidence that the model had learned to apply the subject–verb agreement rule. In our paper, we went beyond this previous work by pre-training BERT models under specific data conditions, allowing us to dig deeper into these results to see how certain patterns in the pre-training data affect performance.

Unseen Subject–Verb Pairs

We first looked at how well the model performs on subject–verb pairs that were seen during pre-training, versus examples in which the subject and verb were never seen together in the same sentence:

BERT’s error rate on natural and nonce evaluation sentences, stratified by whether a particular subject–verb (SV) pair was seen in the same sentence during training or not. BERT’s performance on unseen SV pairs is far better than simple heuristics such as picking the more frequent verb or picking the more frequent SV pair.

BERT’s error rate increases slightly for unseen subject–verb (SV) pairs, for both natural and nonce evaluation sentences, but it is still much better than naïve heuristics, such as picking the verb form that occurred more often in the pre-training data or picking the verb form that occurred more frequently with the subject noun. This tells us that BERT is not just reflecting back the things that it sees during pre-training: making decisions based on more than just raw frequencies and generalizing to novel subject–verb pairs are indications that the model has learned to apply some underlying rule concerning subject–verb agreement.

Frequency of Verbs

Next, we went beyond just seen versus unseen, and examined how the frequency of a word affects BERT’s ability to use it correctly with the subject–verb agreement rule. For this study, we chose a set of 60 verbs, and then created several versions of the pre-training data, each engineered to contain the 60 verbs at a specific frequency, ensuring that the singular and plural forms appeared the same number of times. We then trained BERT models from these different datasets and evaluated them on the subject–verb agreement task:

BERT’s ability to follow the subject–verb agreement rule depends on the frequency of verbs in the training set.

These results indicate that although BERT is able to model the subject–verb agreement rule, it needs to see a verb about 100 times before it can reliably use it with the rule.

Relative Frequency Between Verb Forms

Finally, we wanted to understand how the relative frequencies of the singular and plural forms of a verb affect BERT’s predictions. For example, if one form of the verb (e.g., “combat”) appeared in the pre-training data much more frequently than the other verb form (e.g., “combats”), then BERT might be more likely to assign a high probability to the more frequent form, even when it is grammatically incorrect. To evaluate this, we again used the same 60 verbs, but this time we created manipulated versions of the pre-training data where the frequency ratio between verb forms varied from 1:1 to 100:1. The figure below shows BERT’s performance for these varying levels of frequency imbalance:

As the frequency ratio between verb forms in training data becomes more imbalanced, BERT’s ability to use those verbs grammatically decreases.

These results show that BERT achieves good accuracy at predicting the correct verb form when the two forms are seen the same number of times during pre-training, but the results become worse as the imbalance between the frequencies increases. This implies that even though BERT has learned how to apply subject–verb agreement, it does not necessarily use it as a “rule”, instead preferring to predict high-frequency words regardless of whether they violate the subject–verb agreement constraint.


Using TSE to evaluate the performance of BERT reveals its linguistic abilities on syntactic tasks. Moreover, studying its syntactic ability in relation to how often words appear in the training dataset reveals the ways that BERT handles competing priorities — it knows that subjects and verbs should agree and that high frequency words are more likely, but doesn’t understand that agreement is a rule that must be followed and that the frequency is only a preference. We hope this work provides new insight into how language models reflect properties of the datasets on which they are trained.


It was a privilege to collaborate with Tal Linzen and Ellie Pavlick on this project.