Sascha Rothe
Sascha Rothe is a Staff Research Scientist at Google. He received his Ph.D. from the University of Munich (LMU) where his research was focused on word embeddings. Since joining Google he is working on various natural language generation problems, like summarization or grammatical error correction. He is particularly interested in large language models with their new opportunities and challenges.
Research Areas
Authored Publications
Sort By
Boosting Search Engines with Interactive Agents
Lasse Jesper Garding Espeholt
Leonard Adolphs
Michelle Chen Huebscher
Pier Giuseppe Sessa
Thomas Hofmann
Yannic Kilcher
Transactions on Machine Learning Research (2022)
Preview abstract
This paper presents first successful steps in designing search agents that learn meta-strategies for iterative query refinement in information-seeking tasks. Our approach uses machine reading to guide the selection of refinement terms from aggregated search results. Agents are then empowered with simple but effective search operators to exert fine-grained and transparent control over queries and search results. We develop a novel way of generating synthetic search sessions, which leverages the power of transformer-based language models through (self-)supervised learning. We also present a reinforcement learning agent with dynamically constrained actions that learns interactive search strategies from scratch. Our search agents obtain retrieval and answer quality performance comparable to recent neural methods, using only a traditional term-based BM25 ranking function and interpretable discrete reranking and filtering actions.
View details
Preview abstract
In this paper we introduce a Focus Attention MEchanism to two popular Seq2Seq architectures: RoBERTaS2S and Pegasus . Both RoBERTaS2S and Pegasus use Transformer-based encoder-decoder architecture; at each decoding step decoder learns a single contextual representation necessary to predict the next token by attending to the input sequence and the sequence that has been predicted so far. The focus attention takes inspiration from human-written text and augments this contextual representation through a dynamic vocabulary biasing to proactively generate tokens that are similar or topical to the input sequence. When evaluated on the BBC extreme summarization task, both RoBERTaS2S and Pegasus with Focus Attention generate summaries that are more faithful to their input documents, in comparison to their counterparts. Models with focus attention can holistically learn any abstract-level properties, such as mostly extractive, mostly abstractive or text-editing only, embodied in the target texts, without introducing any task-specific architectural priors. Finally, by its virtue, it supports Focus Sampling -- a technique to sample topically relevant tokens to generate diverse, yet topically consistent and faithful outputs.
View details
Preview abstract
We propose a new model for grammatical error correction (GEC) which builds on a very large multilingual masked language model, covering 101 languages. To adapt our model for the GEC task, we design an unsupervised, language-agnostic pretraining objective that mimics corrections typically contained in labeled data. After finetuning on gold data, we surpass the previous state-of-the-art results on the four evaluated languages (Czech, English, German and Russian). This approach shows the power of large multilingual language models. Due to these models being non-trivial to run on non-cluster infrastructure, we employ our model to clean up the labels in the popular yet noisy Lang-8 dataset. We release this dataset and hope that the community will find it useful for further advancement of GEC.
View details
Leveraging Pre-trained Checkpoints for Sequence Generation Tasks
Transactions of the Association for Computational Linguistics, 8 (2020), pp. 264-280
Preview abstract
Pre-training Neural Networks have become widely successful in Natural Language Processing.
Training these large models on unsupervised data is costly and often not feasible.
We therefore concentrate on publicly available checkpoints.
While most of them improve the Natural Language Understanding, we investigate initializing Transformer-based Sequence-to-sequence models with these pre-trained models for Natural Language Understanding and Generation.
Using these pre-trained models we achieve new state-of-the-art results on Machine translation, Summarization and Sentence Splitting/Fusion.
View details
Preview abstract
We propose MASKER, an unsupervised text-editing method for style transfer. To tackle cases when no parallel source–target pairs are available, we train masked language models (MLMs) for both the source and the target domain. Then we find the text spans where the two models disagree the most in terms of likelihood. This allows us to identify the source tokens to delete to transform the source text to match the style of the target domain. The deleted tokens are replaced with the target MLM, and by using a padded MLM variant, we avoid having to predetermine the number of inserted tokens. Our experiments on sentence fusion and sentiment transfer demonstrate that MASKER performs competitively in a fully unsupervised setting. Moreover, in low-resource settings, it improves supervised methods’ accuracy by over 10 percentage points when pre-training them on silver training data generated by MASKER.
View details
Preview abstract
We propose LaserTagger - a sequence tagging approach that casts text generation as a text editing task. Target texts are reconstructed from the inputs using three main edit operations: keeping a token, deleting it, and adding a phrase before the token. To predict the edit operations, we propose a novel model, which combines a BERT encoder with an autoregressive Transformer decoder. This approach is evaluated on English text on four tasks: sentence fusion, sentence splitting, abstractive summarization, and grammar correction. LaserTagger achieves new state-of-the-art results on three of these tasks, performs comparably to a set of strong seq2seq baselines with a large number of training examples, and outperforms them when the number of examples is limited. Furthermore, we show that at inference time tagging can be more than two orders of magnitude faster than comparable seq2seq models, making it more attractive for running in a live environment.
View details
Sentence-Level Fluency Evaluation: References Help, But Can Be Spared!
Katharina Kann
Proceedings of the 22nd Conference on Computational Natural Language Learning, Association for Computational Linguistics, Brussels, Belgium (2018), pp. 313-323
Preview abstract
Motivated by recent findings on the probabilistic
modeling of acceptability judgments,
we propose syntactic log-odds ratio (SLOR),
a normalized language model score, as a metric
for referenceless fluency evaluation of natural
language generation output at the sentence
level. We further introduce WPSLOR, a novel
WordPiece-based version, which harnesses a
more compact language model. Even though
word-overlap metrics like ROUGE are computed
with the help of hand-written references,
our referenceless methods obtain a significantly
higher correlation with human fluency
scores on a benchmark dataset of compressed
sentences. Finally, we present ROUGE-LM, a
reference-based metric which is a natural extension
of WPSLOR to the case of available
references. We show that ROUGE-LM yields
a significantly higher correlation with human
judgments than all baseline metrics, including
WPSLOR on its own.
View details
Preview abstract
Users try to articulate their complex information needs during search sessions by reformulating their queries. In order to make this process more effective, search engines provide related queries to help users to specify the information need in their search process.
In this paper, we propose a customized sequence-to-sequence model for session-based query suggestion.In our model, we employ a query-aware attention mechanism to capture the structure of the session context. This enables us to control the scope of the session from which we infer the suggested next query, which helps not only handle the noisy data but also automatically detect session boundaries. Furthermore, we observe that based on user query reformulation behavior, a large portion of terms of a query in a session is retained from the previously submitted queries in the same session and consists of mostly infrequent or unseen terms that are usually not included in the vocabulary. We therefore empower the decoder of our model to access the source words from the session context during decoding by incorporating a copy mechanism. Moreover, we propose evaluation metrics to assess the quality of the generative models for query suggestion. We conduct an extensive set of experiments and analysis. The results suggest that our model outperforms the baselines both in terms of the generating queries and scoring candidate queries for the task of query suggestion.
View details
Learning to Learn from Weak Supervision by Full Supervision
Jaap Kamps
NIPS workshop on Meta-Learning (MetaLearn 2017)
Preview abstract
In this paper, we propose a method for training neural networks when we have a large set of data with weak labels and a small amount of data with true labels. In our proposed model, we train two neural networks: a target network, the learner and a confidence network, the meta-learner. The target network is optimized to perform a given task and is trained using a large set of unlabeled data that are weakly annotated. We propose to control the magnitude of the gradient updates to the target network using the scores provided by the second confidence network, which is trained on a small amount of supervised data. Thus we avoid that the weight updates computed from noisy labels harm the quality of the target network model.
View details
Preview abstract
Making use of weak or noisy signals, like the output of heuristic
methods or user click through data for training deep neural networks
is increasing, in particular for the tasks where an adequate
amount of data with true labels is not available. In a semi-supervised
setting, we can use a large set of data with weak labels to pretrain a
neural network and fine tune the parameters with a small amount
of data with true labels. However, these two independent stages do
not leverage the full capacity of clean information from true labels
during pretraining.
In this paper, we propose a semi-supervised learning method
where we train two neural networks in a multi-task fashion: a target
network and a confidence network. The target network is optimized
to perform a given task and is trained using a large set of unlabeled
data that are weakly annotated. We propose to weight the gradient
updates to the target network using the scores provided by the
second confidence network, which is trained on a small amount of
supervised data. Thus we avoid that the weight updates computed
from noisy labels harm the quality of the target network model. We
evaluate our learning strategy on two different tasks: document
ranking and sentiment classification. The results demonstrate that
our approach not only enhances the performance compared to the
baselines but also speeds up the learning process from weak labels.
View details