Aurko Roy
Aurko Roy is currently a Research Scientist in the Google Brain team where he conducts research on the intersection of generative models, structured prediction and natural language processing. He received his PhD in Algorithms, Combinatorics & Optimization from Georgia Tech in 2017.
Research Areas
Authored Publications
Sort By
Efficient Content-Based Sparse Attention with Routing Transformers
Ashish Teku Vaswani
David Grangier
Mohammad Taghi Saffar
Transactions of the Association for Computational Linguistics (2021)
Preview abstract
Self-attention has recently been adopted for
a wide range of sequence modeling problems.
Despite its effectiveness, self-attention suffers from quadratic compute and memory
requirements with respect to sequence length.
Successful approaches to reduce this complexity focused on attending to local sliding windows or a small set of locations independent
of content. Our work proposes to learn dynamic sparse attention patterns that avoid
allocating computation and memory to attend to content unrelated to the query of
interest. This work builds upon two lines of
research: it combines the modeling flexibility
of prior work on content-based sparse attention with the efficiency gains from approaches
based on local, temporal sparse attention. Our
model, the Routing Transformer, endows selfattention with a sparse routing module based
on online k-means while reducing the overall complexity of attention to O(n^1.5d) from
O(n^2d) for sequence length n and hidden dimension d. We show that our model outperforms comparable sparse attention models on
language modeling on Wikitext-103 (15.8 vs
18.3 perplexity), as well as on image generation on ImageNet-64 (3.43 vs 3.44 bits/dim)
while using fewer self-attention layers. Additionally, we set a new state-of-the-art on
the newly released PG-19 data-set, obtaining
a test perplexity of 33.2 with a 22 layer Routing Transformer model trained on sequences
of length 8192.
View details
Preview abstract
There has been remarkable recent progress in factoid open-domain question answering (QA), where a short phrase or entity is sufficient to answer the question. A lot less work has been done in the more challenging task of long-form QA, where the goal is to generate elaborate, paragraph-long answers to more open-ended questions. In this work, we present a new system based on sparse attention and
contrastive retriever learning, which achieves state-of-the-art performance on ELI5, a popular long-form QA dataset in the KILT benchmark (Petroni et al. 2020).
However, a detailed analysis of our system reveals several concerning trends which are hampering progress in this important area: (1) little to no evidence our model's generations are actually grounded in the retrieved documents, a desirable property which is not captured by metrics in the KILT benchmark; (2) a significant training / valid / test set overlap in ELI5, with atleast 75\% validation questions having a paraphrased question in training data; (3) significant issues in the use of the popular evaluation metric ROUGE-L, with a very low margin of improvement (2-5 ROUGE-L) from lower-bound trivial baselines (like input copying) to upper-bound reference baselines; (4) inherent difficulty of human evaluation in this task due to long length of generated answers and unfamiliarity with topics.
View details
Preview abstract
Paraphrasing exemplifies the ability to
abstract semantic content from surface forms. Recent work
on automatic paraphrasing is dominated by methods leveraging
Machine Translation (MT) as an intermediate step. This contrasts with
humans, who can paraphrase without being bilingual.
This work proposes to learn paraphrasing models from an unlabeled
monolingual corpus only. To that end, we propose a residual variant of
vector-quantized variational auto-encoder.
We compare with MT-based approaches on paraphrase identification,
generation, and training augmentation.
Monolingual paraphrasing outperforms unsupervised MT
in all settings. Comparisons with supervised MT are more mixed:
monolingual paraphrasing is interesting for identification and
augmentation; supervised MT is superior for generation.
View details
Preview abstract
Autoencoders provide a powerful framework for learning compressed representations by encoding all of the information needed to reconstruct a data point in
a latent code. In some cases, autoencoders can “interpolate”: By decoding the
convex combination of the latent codes for two datapoints, the autoencoder can
produce an output which semantically mixes characteristics from the datapoints. In
this paper, we propose a regularization procedure which encourages interpolated
outputs to appear more realistic by fooling a critic network which has been trained
to recover the mixing coefficient from interpolated data. We then develop a simple
benchmark task where we can quantitatively measure the extent to which various
autoencoders can interpolate and show that our regularizer dramatically improves
interpolation in this setting. We also demonstrate empirically that our regularizer
produces latent codes which are more effective on downstream tasks, suggesting a
possible link between interpolation abilities and learning useful representations.
View details
Fast Decoding in Sequence Models Using Discrete Latent Variables
Lukasz Kaiser
Ashish Vaswani
Niki J. Parmar
Samy Bengio
Jakob Uszkoreit
Noam Shazeer
ICML (2018)
Preview abstract
Auto-regressive sequence models based on deep neural networks, such as
RNNs, Wavenet and Transformer are the state of the art on many tasks.
However, they lack parallelism and are thus slow for long sequences.
RNNs lack parallelism both during training and decoding, while
architectures like WaveNet and Transformer are much more parallel
during training, but still lack parallelism during decoding.
We present a method to extend sequence models using
discrete latent variables that makes decoding much more parallel.
The main idea behind this approach is to first autoencode the
target sequence into a shorter discrete latent sequence,
which is generated auto-regressively,
and finally decode the full sequence from this shorter
latent sequence in a parallel manner.
We verify that our method works on the task of neural machine
translation, where our models are an order of magnitude faster than comparable
auto-regressive models. We also introduce a new method for constructing discrete
latent variables that allows us to obtain good BLEU scores.
View details
Preview abstract
It is well known that for neural networks, it is possible to construct
inputs which are misclassified by the network yet indistinguishable from
true data points, known as ``adversarial examples''. We propose a simple
modification to standard neural network architectures, \emph{thermometer
encoding}, which significantly increases the robustness of the network to
adversarial examples. We demonstrate this robustness with experiments
on the MNIST, CIFAR-10, CIFAR-100, and SVHN datasets, and show that
models with thermometer-encoded inputs consistently have higher accuracy
on adversarial examples, while also maintaining the same accuracy on
non-adversarial examples and training more quickly.
View details
Preview abstract
We present a method to create universal, robust, targeted adversarial image patches in the real world. The patches are universal because they can be used to attack any scene, robust because they work under a wide variety of transformations, and targeted because they can cause a classifier to output any target class. These adversarial patches can be printed, added to any scene, photographed, and presented to image classifiers; even when the patches are small, they cause the classifiers to ignore the other items in the scene and report a chosen target class.
View details