Aliaksei Severyn
Research Areas
Authored Publications
Google Publications
Other Publications
Sort By
Text Generation with Text-Editing Models
Daniil Mirylenka
Jakub Adamek
Yue Dong
Proceedings of NAACL 2022, ACL
Preview abstract
Text-editing models have recently become a prominent alternative to seq2seq models for monolingual natural language generation (NLG) tasks such as grammatical error correction, text simplification, and style transfer. These tasks exhibit a large amount of textual overlap between the source and target texts. Text-editing models take advantage of this trait and learn to generate the output by predicting edit operations applied to the source sequence in contrast to seq2seq models that generate the output from scratch. Text-editing models provide several benefits over seq2seq models including faster inference speed, higher sample efficiency, and better control and interpretability of the outputs. This tutorial provides a comprehensive overview of the text-edit based approaches and current state-of-the-art models, analyzing the pros and cons of different methods. We discuss challenges related to productionization and how these models can to help mitigate hallucination and bias, both pressing challenges in the field of text generation.
View details
Preview abstract
We propose a new model for grammatical error correction (GEC) which builds on a very large multilingual masked language model, covering 101 languages. To adapt our model for the GEC task, we design an unsupervised, language-agnostic pretraining objective that mimics corrections typically contained in labeled data. After finetuning on gold data, we surpass the previous state-of-the-art results on the four evaluated languages (Czech, English, German and Russian). This approach shows the power of large multilingual language models. Due to these models being non-trivial to run on non-cluster infrastructure, we employ our model to clean up the labels in the popular yet noisy Lang-8 dataset. We release this dataset and hope that the community will find it useful for further advancement of GEC.
View details
Preview abstract
We propose MASKER, an unsupervised text-editing method for style transfer. To tackle cases when no parallel source–target pairs are available, we train masked language models (MLMs) for both the source and the target domain. Then we find the text spans where the two models disagree the most in terms of likelihood. This allows us to identify the source tokens to delete to transform the source text to match the style of the target domain. The deleted tokens are replaced with the target MLM, and by using a padded MLM variant, we avoid having to predetermine the number of inserted tokens. Our experiments on sentence fusion and sentiment transfer demonstrate that MASKER performs competitively in a fully unsupervised setting. Moreover, in low-resource settings, it improves supervised methods’ accuracy by over 10 percentage points when pre-training them on silver training data generated by MASKER.
View details
Preview abstract
We present FELIX --- a flexible text-editing approach for generation, designed to derive maximum benefit from the ideas of decoding with bi-directional contexts and self-supervised pre-training. In contrast to conventional sequence-to-sequence (seq2seq) models, FELIX is efficient in low-resource settings and fast at inference time, while being capable of modeling flexible input-output transformations. We achieve this by decomposing the text-editing task into two sub-tasks: tagging to decide on the subset of input tokens and their order in the output text and insertion to in-fill the missing tokens in the output not present in the input. The tagging model employs a novel Pointer mechanism, while the insertion model is based on a Masked Language Model. Both of these models are chosen to be non-autoregressive to guarantee faster inference. FELIX performs favourably when compared to recent text-editing methods and strong seq2seq baselines when evaluated on four NLG tasks: Sentence Fusion, Machine Translation Automatic Post-Editing, Summarization, and Text Simplification.
View details
Leveraging Pre-trained Checkpoints for Sequence Generation Tasks
Transactions of the Association for Computational Linguistics, vol. 8 (2020), pp. 264-280
Preview abstract
Pre-training Neural Networks have become widely successful in Natural Language Processing.
Training these large models on unsupervised data is costly and often not feasible.
We therefore concentrate on publicly available checkpoints.
While most of them improve the Natural Language Understanding, we investigate initializing Transformer-based Sequence-to-sequence models with these pre-trained models for Natural Language Understanding and Generation.
Using these pre-trained models we achieve new state-of-the-art results on Machine translation, Summarization and Sentence Splitting/Fusion.
View details
Using Audio Transformations to Improve Comprehension in Voice Question Answering
Johanne R. Trippas
Hanna Silen
Damiano Spina
Crestani F. et al. (eds) Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2019, Springer, Cham, pp. 164-170
Preview abstract
Many popular form factors of digital assistants—such as Amazon Echo, Apple Homepod, or Google Home—enable the user to hold a conversation with these systems based only on the speech modality. The lack of a screen presents unique challenges. To satisfy the information need of a user, the presentation of the answer needs to be optimized for such voice-only interactions. In this paper, we propose a task of evaluating the usefulness of audio transformations (i.e., prosodic modifications) for voice-only question answering. We introduce a crowdsourcing setup where we evaluate the quality of our proposed modifications along multiple dimensions corresponding to the informativeness, naturalness, and ability of the user to identify key parts of the answer. We offer a set of prosodic modifications that highlight potentially important parts of the answer using various acoustic cues. Our experiments show that some of these modifications lead to better comprehension at the expense of only slightly degraded naturalness of the audio.
View details
Preview abstract
The softmax function on top of a final linear layer is the de facto method to output probability distributions in neural networks. In many applications such as language models or text generation, these models have to produce distributions over large output vocabularies. Recently, this has been shown to have limited representational capacity due to its connection with the rank bottleneck in matrix factorization. However, little is known about the limitations of linear-softmax for quantities of practical interest such as cross entropy or mode estimation, direction theoretically and empirically explored in this paper. As an efficient and effective solution to alleviate this issue, we propose to learn parametric monotonic functions on top of the logits. Theoretically, we show that such monotonic functions are likely to increase the rank of a matrix to its full rank. Empirically, our method improves over the traditional softmax-linear layer both in synthetic and real language model experiments with negligible time or memory overhead, while being comparable to the more computationally expensive mixture of softmax distributions.
View details
Preview abstract
We propose LaserTagger - a sequence tagging approach that casts text generation as a text editing task. Target texts are reconstructed from the inputs using three main edit operations: keeping a token, deleting it, and adding a phrase before the token. To predict the edit operations, we propose a novel model, which combines a BERT encoder with an autoregressive Transformer decoder. This approach is evaluated on English text on four tasks: sentence fusion, sentence splitting, abstractive summarization, and grammar correction. LaserTagger achieves new state-of-the-art results on three of these tasks, performs comparably to a set of strong seq2seq baselines with a large number of training examples, and outperforms them when the number of examples is limited. Furthermore, we show that at inference time tagging can be more than two orders of magnitude faster than comparable seq2seq models, making it more attractive for running in a live environment.
View details
Preview abstract
In this paper we study various flavors of variational autoencoders and address the methodological issues with the current neural text generation research and also close some gaps by answering a few natural questions to the studies already published.
View details
Neural Ranking Models with Weak Supervision
Hamed Zamani
Jaap Kamps
W. Bruce Croft
Proceedings of The 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (2017)
Preview abstract
Despite the impressive improvements achieved by unsupervised
deep neural networks in computer vision, natural language processing,
and speech recognition tasks, such improvements have not
generally been observed in ranking for information retrieval. The
reason might be related to the complexity of the ranking problem,
in the sense that it is not obvious how to learn from queries and
documents when no supervised signal is available. Hence, in this
paper, we propose to train a neural ranking model from a weak
supervision signal, which is a training signal that can be obtained
automatically without human labeling or any external resources
(e.g., click data). To this aim, we use the output of a known unsupervised
ranking model, such as BM25, as a weak supervision
signal. We further train a set of simple yet effective ranking models
based on feed-forward neural networks. We study their effectiveness
under various learning scenarios (point-wise and pair-wise
models) and using different input representations (i.e., from encoding
query-document pairs into dense/sparse vectors to using word
embedding representation). We train our network on 5 million
unique queries obtained from the publicly available AOL query
logs and two standard collections: a homogeneous news collection
(Robust) and a heterogeneous large-scale web collection (ClueWeb).
Our experiments indicate that feeding raw data to the networks
and letting them learn representations for the input data leads to
an impressive performance, with over 13% and 35% MAP improvements
compared to the BM25 model on the Robust and the ClueWeb
collections, respectively. Our findings suggest that neural ranking
models can greatly benefit from large amounts of weakly labeled
data that can be easily obtained from unsupervised IR models.
View details
Preview abstract
In this paper we explore the effect of architectural choices on Variational Autoencoder models for text.
In contrast to the previously introduced VAE model for text where both the encoder and decoder are RNNs, we propose a novel hybrid architecture that blends a fully feed-forward convolutional and deconvolutional component with a recurrent language model. This architecture exhibits several attractive properties such as fast run time, ability to better handle long sequences and, more importantly, we demonstrate that our model helps to avoid some of the major difficulties posed by training VAE models on textual data.
View details
Learning to Learn from Weak Supervision by Full Supervision
Jaap Kamps
NIPS workshop on Meta-Learning (MetaLearn 2017)
Preview abstract
In this paper, we propose a method for training neural networks when we have a large set of data with weak labels and a small amount of data with true labels. In our proposed model, we train two neural networks: a target network, the learner and a confidence network, the meta-learner. The target network is optimized to perform a given task and is trained using a large set of unlabeled data that are weakly annotated. We propose to control the magnitude of the gradient updates to the target network using the scores provided by the second confidence network, which is trained on a small amount of supervised data. Thus we avoid that the weight updates computed from noisy labels harm the quality of the target network model.
View details
Preview abstract
Making use of weak or noisy signals, like the output of heuristic
methods or user click through data for training deep neural networks
is increasing, in particular for the tasks where an adequate
amount of data with true labels is not available. In a semi-supervised
setting, we can use a large set of data with weak labels to pretrain a
neural network and fine tune the parameters with a small amount
of data with true labels. However, these two independent stages do
not leverage the full capacity of clean information from true labels
during pretraining.
In this paper, we propose a semi-supervised learning method
where we train two neural networks in a multi-task fashion: a target
network and a confidence network. The target network is optimized
to perform a given task and is trained using a large set of unlabeled
data that are weakly annotated. We propose to weight the gradient
updates to the target network using the scores provided by the
second confidence network, which is trained on a small amount of
supervised data. Thus we avoid that the weight updates computed
from noisy labels harm the quality of the target network model. We
evaluate our learning strategy on two different tasks: document
ranking and sentiment classification. The results demonstrate that
our approach not only enhances the performance compared to the
baselines but also speeds up the learning process from weak labels.
View details
Globally Normalized Transition-Based Neural Networks
Association for Computational Linguistics (2016)
Preview abstract
We introduce a globally normalized transition-based neural network
model that achieves state-of-the-art part-of-speech tagging,
dependency parsing and sentence compression results. Our model is a
simple feed-forward neural network that operates on a task-specific
transition system, yet achieves comparable or better accuracies than
recurrent models.
We discuss the importance of global as opposed to local normalization:
a key insight is that the label bias problem implies that
globally
normalized models can be strictly more expressive
than locally normalized models.
View details
Preview abstract
This paper presents a novel approach to recurrent neural network (RNN) regularization. Differently from the widely adopted dropout method, which is applied to forward connections of feed-forward architectures or RNNs, we propose to drop neurons directly in recurrent connections in a way that does not cause loss of long-term memory. Our approach is as easy to implement and apply as the regular feed-forward dropout and we demonstrate its effectiveness for the most popular recurrent networks: vanilla RNNs, Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) networks. Our experiments on three NLP benchmarks show consistent improvements even when combined with conventional feed-forward dropout.
View details
Opinion Mining on YouTube
Preview
Olga Uryupina
Barbara Plank
Alessandro Moschitti
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL'14) (2014), pp. 1252-1261
No Results Found