Jeremy Cole

Jeremy Cole

Jeremy Cole is a Software Engineer in Google AI Language. He previously completed his Ph.D. at the Pennsylvania State University, advised by David Reitter. His dissertation surrounded the role of working memory in language production processes. He presently works in the area of executable semantic parsing. He has a broad interest in combining rule-based and neural approaches, as well as approaches inspired by human cognition.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    NAIL: Lexical Retrieval Indices with Efficient Non-Autoregressive Decoders
    Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023 (to appear)
    Preview abstract Neural document rerankers are extremely effective in terms of accuracy. However, the best models require dedicated hardware for serving, which is costly and often not feasible. To avoid this serving-time requirement, we present a method of capturing up to 86% of the gains of a Transformer cross-attention model with a lexicalized scoring function that only requires 10-6% of the Transformer's FLOPs per document and can be served using commodity CPUs. When combined with a BM25 retriever, this approach matches the quality of a state-of-the art dual encoder retriever, that still requires an accelerator for query encoding. We introduce NAIL (Non-Autoregressive Indexing with Language models) as a model architecture that is compatible with recent encoder-decoder and decoder-only large language models, such as T5, GPT-3 and PaLM. This model architecture can leverage existing pre-trained checkpoints and can be fine-tuned for efficiently constructing document representations that do not require neural processing of queries. View details
    DIFFQG: Generating Questions to Summarize Factual Changes
    Palak Jain
    Michael Zhang
    Eunsol Choi
    Bhuwan Dhingra
    European Association for Computational Linguistics EACL (2023)
    Preview abstract Question Generation has been emerging as new method to improve QA systems and represent factual information in text. However, despite the rash of new work on the topic, there is still no obvious method to evaluate such systems. Here we present DiffQG, a method to evaluate the precision and recall of question generation systems. DiffQG consists of expert labeled annotations, focusing on the particularly challenging task of generating questions from similar pieces of text. Given an edit to a Wikipedia passage and a noun phrase, annotators wrote questions that are answered by one passage but answered differently or not at all by the other. These questions are intended to be both unambiguous and information-seeking, pushing the bounds of current question generation systems' capabilities. Moreover, as annotators also marked when no such question exists, it serves as a new evaluation for difference detection, which also lacks evaluations with as much diversity as DiffQG. We hope that this dataset will be of value to researchers as they seek to improve such systems for a variety of purposes. View details
    Preview abstract Language models (LMs) trained on raw texts have no access to the real physical world environment. Gordon and Van Durme (2013) argue that they thus suffer from reporting bias, meaning that texts rarely report commonsensical facts about the world but more frequently talk about non-commonsenscical facts/events. If LMs naively overfit to the co-occurrence statistics of training corpora, they learn a biased view of the physical world. While prior studies have repeatedly verified that LMs of smaller scales (e.g. RoBERTa, GTP-2) amplifies reporting bias, it remains unknown whether such trends continue when models are scaled up. We investigate reporting bias in larger language models (LLMs) such as PaLM and GPT-3. Specifically, we query LLMs for the colour of objects, using colour as a representative property for visual commonsense. Surprisingly, we found that LLMs significantly outperform smaller LMs on answering queries about an object's typical colour. We find that LLMs' predictions deviate from corpus co-occurrence statistics induced from resources such as Google Books Ngram and are closer to human judgement. We believe this serves as evidence that larger LMs can overcome reporting bias, rather than showing an inverse scaling function as previously suggested. View details
    Preview abstract By nature of the cost and time required to train Large Language Models (LLMs), the embedded knowledge within is usually frozen at the moment their training data is collected. As a result, LLMs have been shown to suffer from diachronic degradation. The in-context learning paradigm can provide a workaround for this limitation by supplying relevant information at inference time. We introduce a new benchmark to evaluate LLMs for one particular but critical aspect of diachronic change: language acquisition. To that end, we rewrite Winograd-style co-reference resolution problems by replacing a word for a new synthetic but plausible English word. The meaning of the word is given to the model in the prompt via a dictionary definition. We show that the accuracy of LLMs compared to the original Winograd tasks decreases radically in our benchmark and we believe this serves as a measure of progress for future models. View details
    Preview abstract It is only a matter of time before facts become out of date: from the name of \abr{POTUS} to the basketball team Lebron James plays for. This continuously limits the usefulness of previously collected datasets and language models (LMs) trained on them. This problem is exacerbated as LMs are used in the closed book question answering setting, where the pretraining data must contain the facts for the model to remember within its fixed parameters. A frequent paradigm is to update or refresh the dataset every so often, then retrain models with the new data: this is costly, but does it work? In this paper, we introduce a diagnostic dataset for probing LMs for factual knowledge that changes over time. Using it we show that models trained only on the most recent slice of data perform worse on questions about the past than models trained on uniform data across time, while being better on current and future questions. Moreover, we propose jointly modeling text with the time it was created and show that this improves memorization of previous facts, as well as reasoning about the uncertainty around future facts. We also show that models trained with temporal context allow for efficient refreshes as new data arrives without the need of retraining from scratch. View details
    Preview abstract This paper examines the effects of working memory size in incremental grammatical encoding during language production. Our experiment tests different variants of a computational-cognitive model that combines an empirically validated framework of general cognition, ACT-R, with a linguistic theory, Combinatory Categorial Grammar. The model is induced from a corpus of spoken dialogue. This methodology facilitates comparison of different strategies and working memory capacities according to the similarity of the model’s produced sentences to the corpus sentences. The experiment presented shows that while having more working memory available improves performance, using less working memory during realization does as well, even after controlling sentence length. Sentences realized with a more incremental strategy also appear to more closely track the naturalistic data. As high incrementality is correlated with low working memory usage, this study offers a possible mechanism by which syntactic incrementality can be explained. Finally, this paper proposes a multi-disciplinary modeling and simulation-based approach to empirical psycholinguistic inquiry. View details