Michael Collins
Michael Collins's research interests are in natural language processing and machine learning, with a focus on problems such as statistical parsing, structured prediction problems in machine learning, and applications including machine translation, dialog systems, and speech recognition. Michael is a fellow of the Association for Computational Linguistics, and has received various awards including a Sloan fellowship, an NSF Career award, as well as best paper awards at EMNLP (2002, 2004 and 2010), UAI (2004 and 2005), and CONLL (2008).
Research Areas
Authored Publications
Google Publications
Other Publications
Sort By
Measuring Attribution in Natural Language Generation Models
Iulia Turc
Computational Linguistics, vol. 49 (2023), pp. 777-840
Preview abstract
With recent improvements in natural language generation (NLG) models for various applications, it has become imperative to have the means to identify and evaluate whether NLG output is only sharing verifiable information about the external world. In this work, we present a new evaluation framework entitled Attributable to Identified Sources (AIS) for assessing the output of natural language generation models, when such output pertains to the external world. We first define AIS and introduce a two-stage annotation pipeline for allowing annotators to appropriately evaluate model output according to AIS guidelines. We empirically validate this approach on generation datasets spanning three tasks (two conversational QA datasets, a summarization dataset, and a table-to-text dataset) via human evaluation studies that suggest that AIS could serve as a common framework for measuring whether model-generated statements are supported by underlying sources. We release guidelines for the human evaluation studies.
View details
Preview abstract
Most recent coreference resolution systems use search algorithms over possible spans to identify mentions and resolve coreference. We instead present a coreference resolution system that uses a text-to-text (seq2seq) paradigm to predict mentions and links jointly, which simplifies the coreference resolution by eliminating both the search for mentions and coreferences. We implemented the coreference system as a transition system and use multilingual T5 as language model. We obtained state-of-the-art accuracy with 83.3 F1-score on the CoNLL-2012 data set. We use the SemEval-2010 data sets to evaluate on languages other than English and get substantially higher Zero-shot F1-scores for 3 out of 4 languages than previous approaches and significantly exceed previous supervised state-of-the-art results for all five tested languages.
View details
A Well-Composed Text is Half Done! Composition Sampling for Diverse Conditional Generation
Yao Zhao
Mirella Lapata
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL 2022), Association for Computational Linguistics, pp. 21
Preview abstract
We propose Composition Sampling, a simple but effective method to generate diverse outputs for conditional generation of higher quality compared to previous stochastic decoding strategies. It builds on recently proposed plan-based neural generation models (Narayan et al., 2021) that are trained to first create a composition of the output and then generate by conditioning on it and the input. Our approach
avoids text degeneration by first sampling a composition in the form of an entity chain and then using beam search to generate the best possible text grounded to this entity chain. Experiments on summarization (CNN/DailyMail and XSum) and question generation (SQuAD), using existing and newly proposed automatic metrics together with human-based evaluation, demonstrate that Composition Sampling is currently the best available decoding strategy for generating diverse meaningful outputs.
View details
Query Refinement Prompts for Closed-Book Long-Form Question Answering
Reinald Kim Amplayo
arXiv submission (2022)
Preview abstract
Large language models (LLMs) have been shown to perform well in answering questions and in producing long-form texts such as stories and explanations, both in few-shot closed-book settings. While the former can be validated using well-known evaluation metrics, the latter is difficult to evaluate. To this end, we investigate the ability of LLMs to do both tasks at once -- to do question answering that requires long-form answers. Such questions tend to be multifaceted, i.e., they may have ambiguities and/or require information from multiple sources. To this end, we define query refinement prompts that encourage LLMs to explicitly express the multifacetedness in questions and generate long-form answers covering multiple facets of the question. Our experiments on two long-form question answering datasets, ASQA and AQuAMuSe, show that using our prompts allows us to outperform fully finetuned models in the closed book setting, as well as achieve results comparable to retrieve-then-generate open-book models.
View details
Preview abstract
The paper presents an approach to semantic grounding of language models (LMs) that conceptualizes the LM as a conditional model generating text given a desired semantic message. It embeds the LM in an auto-encoder by feeding its output to a semantic parser whose output is in the same representation domain as the input message.
Compared to a baseline that generates text using greedy search, we demonstrate two techniques that improve the fluency and semantic accuracy of the generated text: The first technique samples multiple candidate text sequences from which the semantic parser chooses. The second trains the language model while keeping the semantic parser frozen to improve the semantic accuracy of the auto-encoder.
We carry out experiments on the English WebNLG 3.0 data set, using BLEU to measure the fluency of generated text and standard parsing metrics to measure semantic accuracy. We show that our proposed approaches significantly improve on the greedy search baseline. Human evaluation corroborates the results of the automatic evaluation experiments.
View details
Attributed Question Answering: Evaluation and Modeling for Attributed Large Language Models
Pat Verga
Jianmo Ni
arXiv (2022)
Preview abstract
Large language models (LLMs) have shown impressive results across a variety of tasks while requiring little or no direct supervision. Further, there is mounting evidence that LLMs may have potential in information-seeking scenarios. We believe the ability of an LLM to attribute the text that it generates is likely to be crucial for both system developers and users in this setting. We propose and study Attributed QA as a key first step in the development of attributed LLMs. We develop a reproducable evaluation framework for the task, using human annotations as a gold standard and a correlated automatic metric that we show is suitable for development settings. We describe and benchmark a broad set of architectures for the task. Our contributions give some concrete answers to two key questions (How to measure attribution?, and How well do current state-of-the-art methods perform on attribution?), and give some hints as to how to address a third key question (How to build LLMs with attribution?).
View details
QED: A Linguistically Principled Framework for Explainable Question Answering
Eunsol Choi
TACL (2021)
Preview abstract
A question answering system that in addition to providing an answer provides an explanation of the reasoning that leads to that answer has potential advantages in terms of debuggability, extensibility, and trust. To this end, we propose QED, a linguistically informed, extensible framework for explanations in question answering. A QED explanation specifies the relationship between a question and answer according to formal semantic notions such as referential equality, sentencehood, and entailment. We describe and publicly release an expert-annotated dataset of QED explanations built upon a subset of the Google Natural Questions dataset, and report baseline models on two tasks—post- hoc explanation generation given an answer, and joint question answering and explanation generation. In the joint setting, a promising result suggests that training on a relatively small amount of QED data can improve question answering. In addition to describing the formal, language-theoretic motivations for the QED approach, we describe a large user study showing that the presence of QED explanations significantly improves the ability of untrained raters to spot errors made by a strong neural QA baseline.
View details
Evaluating Explanations: How much do explanations from teachers aid students?
Danish Pruthi
Rachit Bansal
Bhuwan Dhingra
Zachary Chase Lipton
Graham Neubig
Transactions of the Association for Computational Linguistics (TACL) (2021)
Preview abstract
While many methods purport to explain predictions by highlighting salient features, what aims these explanations serve and how they ought to be evaluated often go unstated. In this work, we introduce a framework to quantify the value of explanations via the accuracy gains that they confer on a student model trained to simulate a teacher model. Crucially, the explanations are available to the student during training, but are not available at test time. Compared to prior proposals, our approach is less easily gamed, enabling principled, automatic, model-agnostic evaluation of attributions. Using our framework, we compare numerous attribution methods for text classification and question answering, and observe quantitative differences that are consistent (to a moderate to high degree) across different student model architectures and learning strategies.
View details
TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages
Eunsol Choi
Transactions of the Association for Computational Linguistics (2020)
Preview abstract
Confidently making progress on multilingual modeling requires challenging, trustworthy evaluations. We present TyDi QA, a question answering dataset covering 11 typologically diverse languages. Until recently, most multilingual research in natural language processing has been limited to machine translation or to technical tasks such as tagging and parsing. Question answering offers a scenario that is natural in that non-technical users intuitively understand the task, allowing high quality data collection in the absence of abundant annotators with expertise in both linguistics and the language of interest. This allows us select languages that are diverse with regard to their typology -- the set of linguistic features that each language expresses. We expect that models that can perform well on this set will generalize across a large number of the languages in the world. To encourage a more realistic distribution, the data is collected entirely in each native language without the use of translation (human or otherwise) and question creation is performed without seeing the answers. We present a quantitative analysis of the data quality, we provide example-level linguistic analyses and glosses of language phenomena that would not be found in English-only corpora, and we measure the performance of baseline systems.
View details
Kernel Approximation Methods for Speech Recognition
Alireza Bagheri Garakani
Aurelien Bellet ́
Avner May
Brian Kingsbury
Daniel Hsu
Dong Guo
Fei Sha
Kuan Liu
Linxi Fan
Michael Picheny
Zhiyun Lu
Journal of Machine Learning Research (2019)
Preview abstract
We study large-scale kernel methods for acoustic modeling in speech recognition and compare their
performance to deep neural networks (DNNs). We perform experiments on four speech recognition
datasets, including the TIMIT and Broadcast News benchmark tasks, and compare these two types
of models on frame-level performance metrics (accuracy, cross-entropy), as well as on recognition metrics (word/character error rate). In order to scale kernel methods to these large datasets, we use
the random Fourier feature method of Rahimi and Recht (2007). We propose two novel techniques
for improving the performance of kernel acoustic models. First, in order to reduce the number of
random features required by kernel models, we propose a simple but effective method for feature
selection. The method is able to explore a large number of non-linear features while maintaining a
compact model more efficiently than existing approaches. Second, we present a number of frame-
level metrics which correlate very strongly with recognition performance when computed on the
heldout set; we take advantage of these correlations by monitoring these metrics during training
in order to decide when to stop learning. This technique can noticeably improve the recognition
performance of both DNN and kernel models, while narrowing the gap between them. Additionally,
we show that the linear bottleneck method of Sainath et al. (2013a) improves the performance of
our kernel models significantly, in addition to speeding up training and making the models more
compact. Together, these three methods dramatically improve the performance of kernel acoustic
models, making their performance comparable to DNNs on the tasks we explored.
View details
Fusion of Detected Objects in Text for Visual Question Answering
Jeffrey Ling
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (2019), pp. 2131-2140
Preview abstract
To advance models of multimodal context‚ we introduce a simple yet powerful neural architecture for data that combines vision and natural language. The “Bounding Boxes in Text Transformer” (B2T2) also leverages referential information binding words to portions of the image in a single unified architecture. B2T2 is highly effective on the Visual Commonsense Reasoning benchmark (visualcommonsense.org)‚ achieving a new state-of-the-art with a 25% relative reduction in error rate compared to published baselines and obtaining the best performance to date on the public leaderboard. A detailed ablation analysis shows that the early integration of the visual features into the text analysis is key to the effectiveness of the new architecture. A reference implementation of our models is provided as supplementary material.
View details
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
Christopher Clark
Ming-Wei Chang
NAACL 2019
Preview abstract
In this paper we study yes/no questions that are naturally occurring---meaning that they are generated in unprompted and unconstrained settings. We build a reading comprehension dataset, BoolQ, of such questions, and show that they are unexpectedly challenging. They often query for complex, non-factoid information, and require difficult entailment-like inference to solve.
We also explore the effectiveness of a range of transfer learning baselines. We find that transferring from entailment data is more effective than transferring from paraphrase or extractive QA data, and that it, surprisingly, continues to be very beneficial even when starting from massive pre-trained language models such as BERT. Our best method trains BERT on MultiNLI and then re-trains it on our train set. It achieves 80.4% accuracy compared to 90% accuracy of human annotators (and 62% majority-baseline), leaving a significant gap for future work.
View details
Natural Questions: a Benchmark for Question Answering Research
Olivia Redfield
Danielle Epstein
Illia Polosukhin
Matthew Kelcey
Jacob Devlin
Llion Jones
Ming-Wei Chang
Jakob Uszkoreit
Transactions of the Association of Computational Linguistics (2019) (to appear)
Preview abstract
We present the Natural Questions corpus, a question answering dataset. Questions consist of real anonymized, aggregated queries issued to the Google search engine. An annotator is presented with a question along with a Wikipedia page from the top 5 search results, and annotates a long answer (typically a paragraph) and a short answer (one or more entities) if present on the page, or marks null if no long/short answer is present. The public release consists of 307,373 training examples with single annotations, 7,830 examples with 5-way annotations for development data, and a further 7,842 examples 5-way annotated sequestered as test data. We present experiments validating quality of the data. We also describe analysis of 25-way annotations on 302 examples, giving insights into human variability on the annotation task. We introduce robust metrics for the purposes of evaluating question answering systems; demonstrate high human upper bounds on these metrics; and establish baseline results using competitive methods drawn from related literature.
View details
Synthetic QA Corpora Generation with Roundtrip Consistency
Emily Pitler
Jacob Devlin
Association for Computational Linguistics (ACL), Florence, Italy (2019)
Preview abstract
We introduce a novel method of generating synthetic question answering corpora by combining models of question generation and answer extraction, and by filtering the results to ensure roundtrip consistency. By pretraining on the resulting corpora we obtain significant improvements on SQuAD2 and NQ, establishing a new state-of-the-art on the latter. Our synthetic data generation models, for both question generation and answer extraction, can be fully reproduced by finetuning a publicly available BERT model on the extractive subsets of SQuAD2 and NQ. We also describe a more powerful variant that does full sequence-to-sequence pretraining for question generation, obtaining exact match and F1 at less than 0.1% and 0.4% from human performance on SQuAD2.
View details
A Polynomial-Time Dynamic Programming Algorithm for Phrase-Based Decoding with a Fixed Distortion Limit
Transactions of the Association for Computational Linguistics (TACL), vol. 5 (2017), pp. 59-71
Preview abstract
Decoding of phrase-based translation models
in the general case is known to be NP complete,
by a reduction from the traveling
salesman problem (Knight, 1999). In practice,
phrase-based systems often impose a hard
distortion limit that limits the movement of
phrases during translation. However, the impact
on complexity after imposing such a constraint
is not well studied. In this paper, we
describe a dynamic programming algorithm
for phrase-based decoding with a fixed distortion
limit. The runtime of the algorithm is
O(n d! l h^{d+1}) where n is the sentence length,
d is the distortion limit, l is a bound on the
number of phrases starting at any position in
the sentence, and h is related to the maximum
number of target language translations for any
source word. The algorithm makes use of a
novel representation that gives a new perspective
on decoding of phrase-based models.
View details
Source-Side Left-to-Right or Target-Side Left-to-Right? An Empirical Comparison of Two Phrase-Based Decoding Algorithms
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 1496–1500
Preview abstract
This paper describes an empirical study of the phrase-based decoding algorithm proposed
by Chang and Collins (2017). The algorithm produces a translation by processing the source-language sentence in strictly left-to-right order, differing from commonly used approaches that build the target-language sentence in left-to-right order. Our results show that the new algorithm is competitive with Moses (Koehn et al., 2007) in terms of both speed and BLEU scores.
View details
Transforming Dependency Structures to Logical Forms for Semantic Parsing
Siva Reddy
Oscar Täckström
Mark Steedman
Mirella Lapata
Transactions of the Association for Computational Linguistics, vol. 4 (2016)
Preview abstract
The strongly typed syntax of grammar formalisms such as CCG, TAG, LFG and HPSG offers a synchronous framework for deriving syntactic
structures and semantic logical forms. In contrast - partly due to the lack of a strong type system - dependency structures are easy to annotate and have become a widely used form of syntactic analysis for many languages. However, the lack of a type system makes a formal
mechanism for deriving logical forms from dependency structures challenging. We address this by introducing a robust system based on the lambda calculus for deriving neo-Davidsonian logical forms from dependency trees. These logical forms are then used for semantic parsing of natural language to Freebase. Experiments on the Free917 and WebQuestions datasets show that our representation is superior to the original dependency trees and that it outperforms a CCG-based representation on this task. Compared to prior work, we obtain the strongest result to date on Free917 and competitive results on WebQuestions.
View details
Globally Normalized Transition-Based Neural Networks
Association for Computational Linguistics (2016)
Preview abstract
We introduce a globally normalized transition-based neural network
model that achieves state-of-the-art part-of-speech tagging,
dependency parsing and sentence compression results. Our model is a
simple feed-forward neural network that operates on a task-specific
transition system, yet achieves comparable or better accuracies than
recurrent models.
We discuss the importance of global as opposed to local normalization:
a key insight is that the label bias problem implies that
globally
normalized models can be strictly more expressive
than locally normalized models.
View details
Structured Training for Neural Network Transition-Based Parsing
Preview
Proceedings of the 53th Annual Meeting of the Association for Computational Linguistics (ACL '15) (2015)
Case-Factor Diagrams for Structured Probabilistic Modeling
David McAllester
Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence (2004)
Case-Factor Diagrams for Structured Probabilistic Modeling
Answer Extraction
AT&T at TREC-8
Amit Singhal
Steven P. Abney
Donald Hindle
TREC (1999)