Ruiqi Guo
Research Areas
Authored Publications
Sort By
On Emergence of Activation Sparsity in Trained Transformers
Zonglin Li
Chong You
Daliang Li
Ke Ye
International Conference on Learning Representations (2023) (to appear)
Preview abstract
This paper reveals a curious observation that modern large-scale machine learning models with Transformer architectures have sparse activation maps. By activation map we refer to the intermediate output of the multi-layer perceptrons (MLPs) after a ReLU activation function, and by ``sparse'' we mean that on average very few entries (e.g., 3.0% for T5-Base and 6.3% for ViT-B16) are nonzero for each input to MLP. Through extensive experiments we demonstrate that the emergence of sparsity is a prevalent phenomenon that occurs for both natural language processing and vision tasks, on both training and evaluation data, for Transformers of various configurations, at layers of all depth levels, etc. Moreover, larger Transformers with more layers and higher MLP hidden dimensions are sparser as measured by the percentage of nonzero entries. To probe why sparsity emerges, we design experiments with random labels, random images, and infinite data, and find that sparsity may be due primarily to optimization while has little to do with the properties of training dataset. We discuss how sparsity immediately implies a means for significantly reducing the FLOP count and improving efficiency for Transformers. Moreover, we demonstrate perhaps surprisingly that explicitly enforcing an even sparser activation via Top-K thresholding with a small value of k brings a collection of desired but missing properties for Transformers, namely less sensitivity to noisy training data, more robustness to input corruptions, and better calibration for their prediction confidence.
View details
Preview abstract
The approximate nearest neighbor (ANN) search problem is fundamental to efficiently serving many real-world machine learning applications. A number of techniques have been developed for ANN search that are efficient, accurate, and scalable. However, such techniques typically have a number of parameters that affect the speed-recall tradeoff, and exhibit poor performance when such parameters aren't properly set. Tuning these parameters has traditionally been a manual process, demanding in-depth knowledge of the underlying search algorithm. This is becoming an increasingly unrealistic demand as ANN search grows in popularity. To tackle this obstacle to ANN adoption, this work proposes a constrained optimization-based approach to tuning quantization-based ANN algorithms. Our technique takes just a desired search cost or recall as input, and then generates tunings that, empirically, are very close to the speed-recall Pareto frontier and give leading performance on standard benchmarks.
View details
SOAR: Improved Indexing for Approximate Nearest Neighbor Search
David Simcha
Dave Dopson
Neural Information Processing Systems (2023)
Preview abstract
This paper introduces SOAR: Spilling with Orthogonality-Amplified Residuals, a novel data indexing technique for approximate nearest neighbor (ANN) search. SOAR extends upon previous approaches to ANN search, such as spill trees, that utilize multiple redundant representations while partitioning the data to reduce the probability of missing a nearest neighbor during search. Rather than training and computing these redundant representations independently, however, SOAR uses an orthogonality-amplified residual loss, which optimizes each representation to compensate for cases where other representations perform poorly. This drastically improves the overall index quality, resulting in state-of-the-art ANN benchmark performance while maintaining fast indexing times and low memory consumption.
View details
Preview abstract
Language models can be augmented with context retriever to incorporate knowl-edge from large external databases. By leveraging retrieved context, the neural net-work does not have to memorize the massive amount of world knowledge within its internal parameters, leading to better parameter efficiency, interpretability and mod-ularity. In this paper we examined a simple yet effective architecture for incorporat-ing external context into language models based on decoupled Encoder-Decoder architecture. We showed that such a simple architecture achieves competitive results on auto-regressive language modeling and open domain question answer-ing tasks. We also analyzed the behavior of the proposed model which performs grounded context transfer. Finally we discussed the computational implications of such retrieval augmented models.
View details
Efficient Training of Retrieval Models using Negative Cache
Erik Lindgren
Neural Information Processing Systems 2021 (2021)
Preview abstract
Factorized models, such as two tower neural network models, are widely used for
scoring (query, document) pairs in information retrieval tasks. These models are
typically trained by optimizing the model parameters to score relevant “positive"
pairs higher than the irrelevant “negative" ones. While a large set of negatives
typically improves the model performance, limited computation and memory
budgets place constraints on the number of negatives used during training. In this
paper, we develop a novel negative sampling technique for accelerating training
with softmax cross-entropy loss. By using cached (possibly stale) item embeddings,
our technique enables training with a large pool of negatives with reduced memory
and computation. We also develop a streaming variant of our algorithm geared
towards very large datasets. Furthermore, we establish a theoretical basis for our
approach by showing that updating a very small fraction of the cache at each
iteration can still ensure fast convergence. Finally, we experimentally validate our
approach and show that it is efficient and compares favorably with more complex,
state-of-the-art approaches.
View details
Accelerating Large-Scale Inference with Anisotropic Vector Quantization
Erik Lindgren
Quan Geng
David Simcha
International Conference on Machine Learning (2020)
Preview abstract
Quantization based techniques are the current state-of-the-art for scaling maximum inner product search to massive databases. Traditional approaches to quantization aim to minimize the reconstruction error of the database points. Based on the observation that for a given query, the database points that have the largest inner products are more relevant, we develop a family of anisotropic quantization loss functions. Under natural statistical assumptions, we show that quantization with these loss functions leads to a new variant of vector quantization that more greatly penalizes the parallel component of a datapoint's residual relative to its orthogonal component. The proposed approach, whose implementation is open-source, achieves state-of-the-art results on the public benchmarks available at ann-benchmarks.com.
View details
Preview abstract
We characterize the minimum noise amplitude and power for noise-adding mechanisms in (epsilon, delta)-differential privacy for single real-valued query function. We derive new lower bounds using the duality of linear programming, and new upper bounds by proposing a new class of (epsilon, delta)-differentially private mechanisms, the \emph{truncated Laplacian} mechanisms. We show that the multiplicative gap of the lower bounds and upper bounds goes to zero in various high privacy regimes, proving the tightness of the lower and upper bounds and thus establishing the optimality of the truncated Laplacian mechanism. In particular, our results close the previous constant multiplicative gap in the discrete setting. Numeric experiments show the improvement of the truncated Laplacian mechanism over the optimal Gaussian mechanism in all privacy regimes.
View details
Optimal Noise-Adding Mechanism in Additive Differential Privacy
Quan Geng
Proceedings of the 22th International Conference on Artificial Intelligence and Statistics (AISTATS) (2019)
Preview abstract
We derive the optimal $(0, \delta)$-differentially private query-output independent noise-adding mechanism for single real-valued query function under a general cost-minimization framework. Under a mild technical condition, we show that the optimal noise probability distribution is a uniform distribution with a probability mass at the origin. We explicitly derive the optimal noise distribution for general $\ell^p$ cost functions, including $\ell^1$ (for noise magnitude) and $\ell^2$ (for noise power) cost functions, and show that the probability concentration on the origin occurs when $\delta > \frac{p}{p+1}$. Our result demonstrates an improvement over the existing Gaussian mechanisms by a factor of two and three for $(0,\delta)$-differential privacy in the high privacy regime in the context of minimizing the noise magnitude and noise power, and the gain is more pronounced in the low privacy regime. Our result is consistent with the existing result for $(0,\delta)$-differential privacy in the discrete setting, and identifies a probability concentration phenomenon in the continuous setting.
View details
Efficient Natural Language Response Suggestion for Smart Reply
Matthew Henderson
Rami Al-Rfou
Brian Strope
László Lukács
Ray Kurzweil
ArXiv e-prints (2017)
Preview abstract
This paper presents a computationally efficient machine-learned method for natural language response suggestion. Feed-forward neural networks using n-gram embedding features encode messages into vectors which are optimized to give message-response pairs a high dot-product value. An optimized search finds response suggestions. The method is evaluated in a large-scale commercial e-mail application, Inbox by Gmail. Compared to a sequence-to-sequence approach, the new system achieves the same quality at a small fraction of the computational requirements and latency.
View details
Now Playing: Continuous low-power music recognition
Dominik Roblek
James David Lyon
Julian James Odell
Mihajlo Velimirović
NIPS 2017 Workshop: Machine Learning on the Phone
Preview abstract
Existing music recognition applications require both user activation and a connection to a server that performs the actual recognition. In this paper we present a low power music recognizer that runs entirely on a mobile phone and automatically recognizes music without requiring any user activation. A small music detector runs continuously on the mobile phone’s DSP (digital signal processor) chip and only wakes main the processor when it is confident that music is present. Once woken the detector on the main processor is provided with an 8s buffer of audio which is then fingerprinted and compared to the stored fingerprints in the on-device fingerprint database of over 70000 songs.
View details