William W. Cohen

William W. Cohen

William Cohen received his bachelor's degree in Computer Science from Duke University in 1984, and a PhD in Computer Science from Rutgers University in 1990. From 1990 to 2000 Dr. Cohen worked at AT&T Bell Labs and later AT&T Labs-Research, and from April 2000 to May 2002 Dr. Cohen worked at Whizbang Labs, a company specializing in extracting information from the web. From 2002 to 2018, Dr. Cohen worked at Carnegie Mellon University in the Machine Learning Department, with a joint appointment in the Language Technology Institute, as an Associate Research Professor, a Research Professor, and a Professor. Dr. Cohen also was the Director of the Undergraduate Minor in Machine Learning at CMU and co-Director of the Master of Science in ML Program.

Dr. Cohen is a past president of the International Machine Learning Society. In the past he has also served as an action editor for the the AI and Machine Learning series of books published by Morgan Claypool, for the journal Machine Learning, the journal Artificial Intelligence, the Journal of Machine Learning Research, and the Journal of Artificial Intelligence Research. He was General Chair for the 2008 International Machine Learning Conference, held July 6-9 at the University of Helsinki, in Finland; Program Co-Chair of the 2006 International Machine Learning Conference; and Co-Chair of the 1994 International Machine Learning Conference. Dr. Cohen was also the co-Chair for the 3rd Int'l AAAI Conference on Weblogs and Social Media, which was held May 17-20, 2009 in San Jose, and was the co-Program Chair for the 4rd Int'l AAAI Conference on Weblogs and Social Media. He is a AAAI Fellow, and was a winner of the 2008 the SIGMOD "Test of Time" Award for the most influential SIGMOD paper of 1998, and the 2014 SIGIR "Test of Time" Award for the most influential SIGIR paper of 2002-2004.

Dr. Cohen's research interests include information integration and machine learning, particularly information extraction, text categorization and learning from large datasets. He has a long-standing interest in statistical relational learning and learning models, or learning from data, that display non-trivial structure. He holds seven patents related to learning, discovery, information retrieval, and data integration, and is the author of more than 200 publications.

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Recently proposed long-form question answering (QA) systems, supported by large language models (LLMs), have shown promising capabilities. Yet, attributing and verifying their generated abstractive answers can be difficult, and automatically evaluating their accuracy remains an ongoing challenge. In this paper, we introduce a new QA task for answering multi-answer questions by summarizing multiple diverse sources in a semi-extractive fashion. Specifically, Semi-extractive Multi-source QA (SEMQA) requires models to output a comprehensive answer while mixing between factual quoted spans---copied verbatim from given input sources---and non-factual free-text connectors that glue these spans together into a single cohesive passage. This setting bridges the gap between the outputs of well-grounded but constrained extractive QA systems and more fluent but harder to attribute fully abstractive answers. Particularly, it enables a new mode for language models that leverages their advanced language generation capabilities, while also producing fine in-line attributions by-design that are easy to verify, interpret, and evaluate. To study this task, we create the first dataset of this kind with human-written semi-extractive answers to natural and generated questions, and define text-based evaluation metrics. Experimenting with several LLMs in various settings, we find this task to be surprisingly challenging, demonstrating the importance of our work for developing and studying such consolidation capabilities. View details
    Preview abstract Language Models have been shown to store massive amounts of world knowledge implicitly in their parameters. However, even with ever-larger networks, models often fail to encode infrequent information such as rare entities/events, while paying the price of massively increasing computational costs. Recently, retrieval-augmented models, such as REALM, RAG, and RETRO, were proposed to incorporate world knowledge into language models by leveraging an external non-parametric index, achieving impressive performance with constrained model sizes. However, these methods are restricted to retrieving only textual knowledge, neglecting the ubiquitous amount of knowledge in other modalities like images - much of which contains information not covered by any text. To address this limitation, we propose the first Multimodal Retrieval-Augmented Transformer (MuRAG), which accesses an external non-parametric multimodal memory to augment language model pre-training. MuRAG is pre-trained with a mixture of large-scale image-text and text-only corpora using a joint contrastive and generative loss. In experiments, we evaluate MuRAG's performance on two downstream datasets that require retrieving and reasoning over both images and text to answer a given query, WebQA, and MultimodalQA. Our results show that MuRAG's outperforms competitive baselines by more than 10\% accuracy - achieving the best-known performance on those tasks. View details
    Preview abstract In this paper, we demonstrate that information retrieval can be accomplished with a single Transformer, in which all information about the corpus is encoded in the parameters of the model. To this end, we introduce the Differentiable Search Index (DSI), a new paradigm that learns a text-to-text model that maps string queries directly to relevant docids; in other words, a DSI model answers queries directly using only its parameters, dramatically simplifying the whole retrieval process. We study variations in how documents and their identifiers are represented, variations in training procedures, and the interplay between models and corpus sizes. Experiments demonstrate that given appropriate design choices, DSI significantly outperforms strong baselines such as dual encoder models. Moreover, DSI demonstrates strong generalization capabilities, outperforming a BM25 baseline in a zero-shot setup. View details
    Preview abstract Large language models (LLMs) have shown impressive results across a variety of tasks while requiring little or no direct supervision. Further, there is mounting evidence that LLMs may have potential in information-seeking scenarios. We believe the ability of an LLM to attribute the text that it generates is likely to be crucial for both system developers and users in this setting. We propose and study Attributed QA as a key first step in the development of attributed LLMs. We develop a reproducable evaluation framework for the task, using human annotations as a gold standard and a correlated automatic metric that we show is suitable for development settings. We describe and benchmark a broad set of architectures for the task. Our contributions give some concrete answers to two key questions (How to measure attribution?, and How well do current state-of-the-art methods perform on attribution?), and give some hints as to how to address a third key question (How to build LLMs with attribution?). View details
    Preview abstract By nature of the cost and time required to train Large Language Models (LLMs), the embedded knowledge within is usually frozen at the moment their training data is collected. As a result, LLMs have been shown to suffer from diachronic degradation. The in-context learning paradigm can provide a workaround for this limitation by supplying relevant information at inference time. We introduce a new benchmark to evaluate LLMs for one particular but critical aspect of diachronic change: language acquisition. To that end, we rewrite Winograd-style co-reference resolution problems by replacing a word for a new synthetic but plausible English word. The meaning of the word is given to the model in the prompt via a dictionary definition. We show that the accuracy of LLMs compared to the original Winograd tasks decreases radically in our benchmark and we believe this serves as a measure of progress for future models. View details
    Mention Memory: incorporating textual knowledge into Transformers through entity mention attention
    Michiel de Jong
    Yury Zemlyanskiy
    10th International Conference on Learning Representations, ICLR 2022, Virtual Conference , April 25-29, 2022, OpenReview.net
    Preview abstract Natural language understanding tasks such as open-domain question answering often require retrieving and assimilating factual information from multiple sources. We propose to address this problem by integrating a semi-parametric representation of a large text corpus into a Transformer model as a source of factual knowledge. Specifically, our method represents knowledge as a ``mention memory" containing a dense vector representation of every entity mention in a corpus. The Transformer model accesses the information through internal memory layers in which each entity mention in the passage being read attends to the mention memory. This approach enables synthesis of and reasoning over many disparate sources of information \textit{within} a single Transformer model. In experiments using a memory of ~150 million Wikipedia mentions, our model provides to strong improvements in performance on several open-domain knowledge-intensive tasks, including the claim verification benchmarks FEVER and HoVeR and several entity-based QA benchmarks. We also show that the model learns to attend to informative mentions without any direct supervision. Finally we show that the model can be adapted to generalize to new unseen entities by updating the memory, without retraining. View details
    Adaptable and Interpretable Neural Memory Over Symbolic Knowledge
    Haitian Sun
    Pat Verga
    Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics (2021), pp. 3678-3691
    Preview abstract Past research has demonstrated that large neural language models (LMs) encode surprising amounts of factual information: however, augmenting or modifying this information requires modifying a corpus and retraining, which is computationally expensive. To address this problem, we develop a neural LM that includes an interpretable neuro-symbolic KB in the form of a “fact memory”. Each element of the fact memory is formed from a triple of vectors, where each vector corresponds to a KB entity or relation. Our LM improves performance on knowledge-intensive question-answering tasks, sometimes dramatically, including a 27 point increase in one setting of WebQuestionsSP over a state-of-the-art open-book model, despite using 5% of the parameters. Most interestingly, we demonstrate that the model can be modified, without any re-training, by updating the fact memory View details
    Preview abstract It is only a matter of time before facts become out of date: from the name of \abr{POTUS} to the basketball team Lebron James plays for. This continuously limits the usefulness of previously collected datasets and language models (LMs) trained on them. This problem is exacerbated as LMs are used in the closed book question answering setting, where the pretraining data must contain the facts for the model to remember within its fixed parameters. A frequent paradigm is to update or refresh the dataset every so often, then retrain models with the new data: this is costly, but does it work? In this paper, we introduce a diagnostic dataset for probing LMs for factual knowledge that changes over time. Using it we show that models trained only on the most recent slice of data perform worse on questions about the past than models trained on uniform data across time, while being better on current and future questions. Moreover, we propose jointly modeling text with the time it was created and show that this improves memorization of previous facts, as well as reasoning about the uncertainty around future facts. We also show that models trained with temporal context allow for efficient refreshes as new data arrives without the need of retraining from scratch. View details
    MATE: Multi-view Attention for Table Transformer Efficiency
    Maharshi Gor
    Thomas Müller
    Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics
    Preview abstract This work presents a sparse-attention Transformer architecture for modeling documents that contain large tables. Tables are ubiquitous on the web, and are rich in information. However, more than 20% of relational tables on the web have 20 or more rows (Cafarella et al., 2008), and these large tables present a challenge for current Transformer models, which are typically limited to 512 tokens. Here we propose MATE, a novel Transformer architecture designed to model the structure of web tables. MATE uses sparse attention in a way that allows heads to efficiently attend to either rows or columns in a table. This architecture scales linearly with respect to speed and memory, and can handle documents containing more than 8000 tokens with current accelerators. MATE also has a more appropriate inductive bias for tabular data, and sets a new state-of-the-art for three table reasoning datasets. For HybridQA (Chen et al., 2020b), a dataset that involves large documents containing tables, we improve the best prior result by 19 points. View details
    Preview abstract Although large neural language models (LMs) like BERT can be finetuned to yield state-of-the-art results on many NLP tasks, it is often unclear what these models actually learn. Here we study using such LMs to fill in entities in comparative questions, like “Which country is older, India or ___?”—i.e., we study the ability of neural LMs to ask (not answer) reasonable questions. We show that accuracy in this fill-in-the-blank task is well-correlated with human judgements of whether a question is reasonable, and that these models can be trained to achieve nearly human-level performance in completing comparative questions in three different sub-domains. However, analysis shows that what they learn fails to model any sort of broad notion of which entities are semantically comparable or similar—instead the trained models are very domain-specific, and performance is highly correlated with co-occurrences between specific entities observed in the training set. This is true both for models that are pre-trained on general text corpora, as well as models trained on a large corpus of comparison questions. Our study thus reinforces recent results on the difficulty of making claims about a deep model’s world knowledge or linguistic competence based on performance on specific benchmark problems. We make our evaluation datasets publicly available to foster future research. View details