Yin-Wen Chang
Research Areas
Authored Publications
Sort By
$O(n)$ Connections are Expressive Enough: Universal Approximability of Sparse Transformers
Chulhee Yun
Advances in Neural Information Processing Systems (2020)
Preview abstract
Transformer networks use pairwise attention to compute contextual embeddings of their inputs, and have achieved the state of the art performance in many NLP tasks.
However, these models suffer from quadratic computational cost in the input sequence length $n$ to compute attention in each layer. This has prompted recent research into faster attention models, with a predominant approach involving sparsifying the connections in the attention layers. While empirically promising for long sequences, several fundamental questions remain unanswered: Can sparse transformers approximate any arbitrary sequence-to-sequence function, similar to their dense counterparts? How does the sparsity pattern and the sparsity level affect their performance? In this paper, we provide a \emph{unifying framework} that captures existing sparse attention models. Our analysis proposes sufficient conditions under which we show that a sparse attention model can provably \emph{universally approximate} any sequence-to-sequence functions. Surprisingly, our results show the existence of attention models with only $O(n)$ connections per attention layer that can approximate the same function class as the dense model with $n^2$ connections. Lastly, we present experiments comparing different patterns and levels of sparsity on standard NLP tasks.
View details
Preview abstract
We consider the large-scale query-document retrieval problem: given a query (e.g., a question), return the set of relevant documents (e.g., paragraphs containing the answer) from a large document corpus. This problem is often solved in two steps. The retrieval phase first reduces the solution space, returning a subset of candidate documents. The scoring phase then re-ranks the documents. Critically, the retrieval algorithm not only desires high recall but also requires to be highly efficient, returning candidates in time sublinear to the number of documents. Unlike the scoring phase witnessing significant advances recently due to the BERT-style pre-training tasks on cross-attention models, the retrieval phase remains less well studied. Most previous works rely on classic Information Retrieval (IR) methods such as BM-25 (token matching + TF-IDF weights). These models only accept sparse handcrafted features and can not be optimized for different downstream tasks of interest. In this paper, we conduct a comprehensive study on the embedding-based retrieval models. We show that the key ingredient of learning a strong embedding-based Transformer model is the set of pre-training tasks. With adequately designed paragraph-level pre-training tasks, the Transformer models can remarkably improve over the widely-used BM-25 as well as embedding models without Transformers. The paragraph-level pre-training tasks we studied are Inverse Cloze Task (ICT), Body First Selection (BFS), Wiki Link Prediction (WLP), and the combination of all three.
View details
Source-Side Left-to-Right or Target-Side Left-to-Right? An Empirical Comparison of Two Phrase-Based Decoding Algorithms
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 1496–1500
Preview abstract
This paper describes an empirical study of the phrase-based decoding algorithm proposed
by Chang and Collins (2017). The algorithm produces a translation by processing the source-language sentence in strictly left-to-right order, differing from commonly used approaches that build the target-language sentence in left-to-right order. Our results show that the new algorithm is competitive with Moses (Koehn et al., 2007) in terms of both speed and BLEU scores.
View details
A Polynomial-Time Dynamic Programming Algorithm for Phrase-Based Decoding with a Fixed Distortion Limit
Transactions of the Association for Computational Linguistics (TACL), 5 (2017), pp. 59-71
Preview abstract
Decoding of phrase-based translation models
in the general case is known to be NP complete,
by a reduction from the traveling
salesman problem (Knight, 1999). In practice,
phrase-based systems often impose a hard
distortion limit that limits the movement of
phrases during translation. However, the impact
on complexity after imposing such a constraint
is not well studied. In this paper, we
describe a dynamic programming algorithm
for phrase-based decoding with a fixed distortion
limit. The runtime of the algorithm is
O(n d! l h^{d+1}) where n is the sentence length,
d is the distortion limit, l is a bound on the
number of phrases starting at any position in
the sentence, and h is related to the maximum
number of target language translations for any
source word. The algorithm makes use of a
novel representation that gives a new perspective
on decoding of phrase-based models.
View details