Jump to Content
William W. Cohen

William W. Cohen

William Cohen received his bachelor's degree in Computer Science from Duke University in 1984, and a PhD in Computer Science from Rutgers University in 1990. From 1990 to 2000 Dr. Cohen worked at AT&T Bell Labs and later AT&T Labs-Research, and from April 2000 to May 2002 Dr. Cohen worked at Whizbang Labs, a company specializing in extracting information from the web. From 2002 to 2018, Dr. Cohen worked at Carnegie Mellon University in the Machine Learning Department, with a joint appointment in the Language Technology Institute, as an Associate Research Professor, a Research Professor, and a Professor. Dr. Cohen also was the Director of the Undergraduate Minor in Machine Learning at CMU and co-Director of the Master of Science in ML Program.

Dr. Cohen is a past president of the International Machine Learning Society. In the past he has also served as an action editor for the the AI and Machine Learning series of books published by Morgan Claypool, for the journal Machine Learning, the journal Artificial Intelligence, the Journal of Machine Learning Research, and the Journal of Artificial Intelligence Research. He was General Chair for the 2008 International Machine Learning Conference, held July 6-9 at the University of Helsinki, in Finland; Program Co-Chair of the 2006 International Machine Learning Conference; and Co-Chair of the 1994 International Machine Learning Conference. Dr. Cohen was also the co-Chair for the 3rd Int'l AAAI Conference on Weblogs and Social Media, which was held May 17-20, 2009 in San Jose, and was the co-Program Chair for the 4rd Int'l AAAI Conference on Weblogs and Social Media. He is a AAAI Fellow, and was a winner of the 2008 the SIGMOD "Test of Time" Award for the most influential SIGMOD paper of 1998, and the 2014 SIGIR "Test of Time" Award for the most influential SIGIR paper of 2002-2004.

Dr. Cohen's research interests include information integration and machine learning, particularly information extraction, text categorization and learning from large datasets. He has a long-standing interest in statistical relational learning and learning models, or learning from data, that display non-trivial structure. He holds seven patents related to learning, discovery, information retrieval, and data integration, and is the author of more than 200 publications.

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Recently proposed long-form question answering (QA) systems, supported by large language models (LLMs), have shown promising capabilities. Yet, attributing and verifying their generated abstractive answers can be difficult, and automatically evaluating their accuracy remains an ongoing challenge. In this paper, we introduce a new QA task for answering multi-answer questions by summarizing multiple diverse sources in a semi-extractive fashion. Specifically, Semi-extractive Multi-source QA (SEMQA) requires models to output a comprehensive answer while mixing between factual quoted spans---copied verbatim from given input sources---and non-factual free-text connectors that glue these spans together into a single cohesive passage. This setting bridges the gap between the outputs of well-grounded but constrained extractive QA systems and more fluent but harder to attribute fully abstractive answers. Particularly, it enables a new mode for language models that leverages their advanced language generation capabilities, while also producing fine in-line attributions by-design that are easy to verify, interpret, and evaluate. To study this task, we create the first dataset of this kind with human-written semi-extractive answers to natural and generated questions, and define text-based evaluation metrics. Experimenting with several LLMs in various settings, we find this task to be surprisingly challenging, demonstrating the importance of our work for developing and studying such consolidation capabilities. View details
    Preview abstract In this paper, we demonstrate that information retrieval can be accomplished with a single Transformer, in which all information about the corpus is encoded in the parameters of the model. To this end, we introduce the Differentiable Search Index (DSI), a new paradigm that learns a text-to-text model that maps string queries directly to relevant docids; in other words, a DSI model answers queries directly using only its parameters, dramatically simplifying the whole retrieval process. We study variations in how documents and their identifiers are represented, variations in training procedures, and the interplay between models and corpus sizes. Experiments demonstrate that given appropriate design choices, DSI significantly outperforms strong baselines such as dual encoder models. Moreover, DSI demonstrates strong generalization capabilities, outperforming a BM25 baseline in a zero-shot setup. View details
    Mention Memory: incorporating textual knowledge into Transformers through entity mention attention
    Michiel de Jong
    Yury Zemlyanskiy
    10th International Conference on Learning Representations, ICLR 2022, Virtual Conference , April 25-29, 2022, OpenReview.net
    Preview abstract Natural language understanding tasks such as open-domain question answering often require retrieving and assimilating factual information from multiple sources. We propose to address this problem by integrating a semi-parametric representation of a large text corpus into a Transformer model as a source of factual knowledge. Specifically, our method represents knowledge as a ``mention memory" containing a dense vector representation of every entity mention in a corpus. The Transformer model accesses the information through internal memory layers in which each entity mention in the passage being read attends to the mention memory. This approach enables synthesis of and reasoning over many disparate sources of information \textit{within} a single Transformer model. In experiments using a memory of ~150 million Wikipedia mentions, our model provides to strong improvements in performance on several open-domain knowledge-intensive tasks, including the claim verification benchmarks FEVER and HoVeR and several entity-based QA benchmarks. We also show that the model learns to attend to informative mentions without any direct supervision. Finally we show that the model can be adapted to generalize to new unseen entities by updating the memory, without retraining. View details
    Preview abstract Large language models (LLMs) have shown impressive results across a variety of tasks while requiring little or no direct supervision. Further, there is mounting evidence that LLMs may have potential in information-seeking scenarios. We believe the ability of an LLM to attribute the text that it generates is likely to be crucial for both system developers and users in this setting. We propose and study Attributed QA as a key first step in the development of attributed LLMs. We develop a reproducable evaluation framework for the task, using human annotations as a gold standard and a correlated automatic metric that we show is suitable for development settings. We describe and benchmark a broad set of architectures for the task. Our contributions give some concrete answers to two key questions (How to measure attribution?, and How well do current state-of-the-art methods perform on attribution?), and give some hints as to how to address a third key question (How to build LLMs with attribution?). View details
    Preview abstract Language Models have been shown to store massive amounts of world knowledge implicitly in their parameters. However, even with ever-larger networks, models often fail to encode infrequent information such as rare entities/events, while paying the price of massively increasing computational costs. Recently, retrieval-augmented models, such as REALM, RAG, and RETRO, were proposed to incorporate world knowledge into language models by leveraging an external non-parametric index, achieving impressive performance with constrained model sizes. However, these methods are restricted to retrieving only textual knowledge, neglecting the ubiquitous amount of knowledge in other modalities like images - much of which contains information not covered by any text. To address this limitation, we propose the first Multimodal Retrieval-Augmented Transformer (MuRAG), which accesses an external non-parametric multimodal memory to augment language model pre-training. MuRAG is pre-trained with a mixture of large-scale image-text and text-only corpora using a joint contrastive and generative loss. In experiments, we evaluate MuRAG's performance on two downstream datasets that require retrieving and reasoning over both images and text to answer a given query, WebQA, and MultimodalQA. Our results show that MuRAG's outperforms competitive baselines by more than 10\% accuracy - achieving the best-known performance on those tasks. View details
    Preview abstract By nature of the cost and time required to train Large Language Models (LLMs), the embedded knowledge within is usually frozen at the moment their training data is collected. As a result, LLMs have been shown to suffer from diachronic degradation. The in-context learning paradigm can provide a workaround for this limitation by supplying relevant information at inference time. We introduce a new benchmark to evaluate LLMs for one particular but critical aspect of diachronic change: language acquisition. To that end, we rewrite Winograd-style co-reference resolution problems by replacing a word for a new synthetic but plausible English word. The meaning of the word is given to the model in the prompt via a dictionary definition. We show that the accuracy of LLMs compared to the original Winograd tasks decreases radically in our benchmark and we believe this serves as a measure of progress for future models. View details
    MATE: Multi-view Attention for Table Transformer Efficiency
    Maharshi Gor
    Thomas Müller
    Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics
    Preview abstract This work presents a sparse-attention Transformer architecture for modeling documents that contain large tables. Tables are ubiquitous on the web, and are rich in information. However, more than 20% of relational tables on the web have 20 or more rows (Cafarella et al., 2008), and these large tables present a challenge for current Transformer models, which are typically limited to 512 tokens. Here we propose MATE, a novel Transformer architecture designed to model the structure of web tables. MATE uses sparse attention in a way that allows heads to efficiently attend to either rows or columns in a table. This architecture scales linearly with respect to speed and memory, and can handle documents containing more than 8000 tokens with current accelerators. MATE also has a more appropriate inductive bias for tabular data, and sets a new state-of-the-art for three table reasoning datasets. For HybridQA (Chen et al., 2020b), a dataset that involves large documents containing tables, we improve the best prior result by 19 points. View details
    Adaptable and Interpretable Neural Memory Over Symbolic Knowledge
    Haitian Sun
    Pat Verga
    Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics (2021), pp. 3678-3691
    Preview abstract Past research has demonstrated that large neural language models (LMs) encode surprising amounts of factual information: however, augmenting or modifying this information requires modifying a corpus and retraining, which is computationally expensive. To address this problem, we develop a neural LM that includes an interpretable neuro-symbolic KB in the form of a “fact memory”. Each element of the fact memory is formed from a triple of vectors, where each vector corresponds to a KB entity or relation. Our LM improves performance on knowledge-intensive question-answering tasks, sometimes dramatically, including a 27 point increase in one setting of WebQuestionsSP over a state-of-the-art open-book model, despite using 5% of the parameters. Most interestingly, we demonstrate that the model can be modified, without any re-training, by updating the fact memory View details
    Preview abstract Although large neural language models (LMs) like BERT can be finetuned to yield state-of-the-art results on many NLP tasks, it is often unclear what these models actually learn. Here we study using such LMs to fill in entities in comparative questions, like “Which country is older, India or ___?”—i.e., we study the ability of neural LMs to ask (not answer) reasonable questions. We show that accuracy in this fill-in-the-blank task is well-correlated with human judgements of whether a question is reasonable, and that these models can be trained to achieve nearly human-level performance in completing comparative questions in three different sub-domains. However, analysis shows that what they learn fails to model any sort of broad notion of which entities are semantically comparable or similar—instead the trained models are very domain-specific, and performance is highly correlated with co-occurrences between specific entities observed in the training set. This is true both for models that are pre-trained on general text corpora, as well as models trained on a large corpus of comparison questions. Our study thus reinforces recent results on the difficulty of making claims about a deep model’s world knowledge or linguistic competence based on performance on specific benchmark problems. We make our evaluation datasets publicly available to foster future research. View details
    Evaluating Explanations: How much do explanations from teachers aid students?
    Danish Pruthi
    Rachit Bansal
    Bhuwan Dhingra
    Zachary Chase Lipton
    Graham Neubig
    Transactions of the Association for Computational Linguistics (TACL) (2021)
    Preview abstract While many methods purport to explain predictions by highlighting salient features, what aims these explanations serve and how they ought to be evaluated often go unstated. In this work, we introduce a framework to quantify the value of explanations via the accuracy gains that they confer on a student model trained to simulate a teacher model. Crucially, the explanations are available to the student during training, but are not available at test time. Compared to prior proposals, our approach is less easily gamed, enabling principled, automatic, model-agnostic evaluation of attributions. Using our framework, we compare numerous attribution methods for text classification and question answering, and observe quantitative differences that are consistent (to a moderate to high degree) across different student model architectures and learning strategies. View details
    Preview abstract It is only a matter of time before facts become out of date: from the name of \abr{POTUS} to the basketball team Lebron James plays for. This continuously limits the usefulness of previously collected datasets and language models (LMs) trained on them. This problem is exacerbated as LMs are used in the closed book question answering setting, where the pretraining data must contain the facts for the model to remember within its fixed parameters. A frequent paradigm is to update or refresh the dataset every so often, then retrain models with the new data: this is costly, but does it work? In this paper, we introduce a diagnostic dataset for probing LMs for factual knowledge that changes over time. Using it we show that models trained only on the most recent slice of data perform worse on questions about the past than models trained on uniform data across time, while being better on current and future questions. Moreover, we propose jointly modeling text with the time it was created and show that this improves memorization of previous facts, as well as reasoning about the uncertainty around future facts. We also show that models trained with temporal context allow for efficient refreshes as new data arrives without the need of retraining from scratch. View details
    DIFFERENTIABLE MULTI-HOP REASONING OVER A VIRTUAL KNOWLEDGE BASE
    Bhuwan Dhingra
    Graham Neubig
    Ruslan Salakhutdinov
    Vidhisha Balachandran
    ICLR (2020) (to appear)
    Preview abstract We wish to put forward an approach for accessing text as a knowledge base which is useful for question-answering (QA). This approach relies centrally on development of a differentiable operator which allows us to traverse textual data like a ``virtual'' KB. The core of the approach is a neural module that inputs and outputs sets of entities: in particular, this module uses maximum inner product search (MIPS) on a special index to map a set of entities $X$ to all entities $Y$ related to something in $X$ (by some specified relations), as witnessed by some text in the corpus. For multi-hop questions, the set of output entities $Y$ can be again used recursively as the input to a second copy of the module, enabling us to answer complex questions. This module is differentiable, so the full system can be trained completely end-to-end using gradient based methods. Thus, we name it DrKIT: Differentiable Reasoning over a virtual Knowledge base of Indexed Text. We describe a pretraining scheme for the index mention encoder by generating hard negative examples using existing knowledge bases, and we show that DrKIT improves accuracy by $9$ points on 3-hop questions in the MetaQA dataset, cutting the gap between text-based and KB-based methods by $70\%$. DrKIT is also very efficient, processing 10x more queries per second than existing state-of-the-art multi-hop QA systems. View details
    Preview abstract Automatically constructed datasets for generating text from semi-structured data (tables), such as WikiBio, often contain reference texts that diverge from the information in the corresponding semi-structured data. We show that metrics which rely solely on the reference texts, such as BLEU and ROUGE, show poor correlation with human judgments when those references diverge. We propose a new metric, PARENT, which aligns n-grams from the reference and generated texts to the semi-structured data before computing their precision and recall. Through a large scale human evaluation study of table-to-text models for WikiBio, we show that PARENT correlates with human judgments better than existing text generation metrics. We also adapt and evaluate the information extraction based evaluation proposed by Wiseman et al (2017), and show that PARENT has comparable correlation to it, while being easier to use. We show that PARENT is also applicable when the reference texts are elicited from humans using the data from the WebNLG challenge. View details
    Incremental Learning from Text for Question Answering
    Samira Abnar
    Continual Learning Workshop, Neural Information Processing Systems (NIPS) 2018
    Preview abstract Any system which performs goal-directed continual learning must not only learn incrementally but process and absorb information incrementally. Such a system also has to understand when its goals have been achieved. In this paper, we consider these issues in the context of question answering. Current state-of-the-art question answering models reason over an entire passage, not incrementally. As we will show, naive approaches to incremental reading, such as restriction to unidirectional language models in the model, perform poorly. We present extensions to the DocQA [2] model to allow incremental reading without loss of accuracy. The model also jointly learns to provide the best answer given the text that is seen so far and predict whether this best-so-far answer is sufficient. View details
    Structured Literature Image Finder: Extracting Information from Text and Images in Biomedical Literature.
    Luis Pedro Coelho
    Andrew Arnold
    Joshua Kangas
    Saboor Sheikh
    Eric P. Xing
    Robert Murphey
    Lecture Notes in Bioinformatics, Springer (2010)
    Structured Literature Image Finder: Parsing Text and Figures in Biomedical Literature
    Andrew Arnold
    Luis Pedro Coelho
    Joshua Kangas
    Saboor Sheikh
    Eric P. Xing
    Robert Murphy
    Journal of Web Semantics (2010)
    SLIF: Structure Literature Image Finer
    Andrew Arnold
    Luis Pedro Coelho
    Joshua Kangas
    Saboor Sheikh
    Eric P Xing
    Robert Murphy
    The Annual Meeting of The ISMB BioLINK Special Interest Group : Text Mining, Image Analysis and the future of Scientific Publishing In Association with ISMB/ECCB (2009)
    Information Extraction as Link Prediction: Using Curated Citation Networks to Improve Gene Detection
    Andrew Arnold
    AAAI Conference on Weblogs and Social Media (ICWSM) (2009)
    Exploiting Feature Hierarchy for Transfer Learning in Named Entity Recognition
    Andrew Arnold
    Ramesh Nallapati
    46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL:HLT) (2008)
    Intra-document Structural Frequency Features for Semi-supervised Domain Adaptation
    Andrew Arnold
    Association for Computing Machinery Conference on Information and Knowledge Management (CIKM) (2008)
    A Comparative Study of Methods for Transductive Transfer Learning
    Andrew Arnold
    Ramesh Nallapati
    IEEE International Conference on Data Mining (ICDM) 2007 Workshop on Mining and Management of Biological Data (2007)