Jump to Content
Felix X. Yu

Felix X. Yu

I work on large-scale machine learning at Google. More info can be found at www.felixyu.org.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract This paper reveals a curious observation that modern large-scale machine learning models with Transformer architectures have sparse activation maps. By activation map we refer to the intermediate output of the multi-layer perceptrons (MLPs) after a ReLU activation function, and by ``sparse'' we mean that on average very few entries (e.g., 3.0% for T5-Base and 6.3% for ViT-B16) are nonzero for each input to MLP. Through extensive experiments we demonstrate that the emergence of sparsity is a prevalent phenomenon that occurs for both natural language processing and vision tasks, on both training and evaluation data, for Transformers of various configurations, at layers of all depth levels, etc. Moreover, larger Transformers with more layers and higher MLP hidden dimensions are sparser as measured by the percentage of nonzero entries. To probe why sparsity emerges, we design experiments with random labels, random images, and infinite data, and find that sparsity may be due primarily to optimization while has little to do with the properties of training dataset. We discuss how sparsity immediately implies a means for significantly reducing the FLOP count and improving efficiency for Transformers. Moreover, we demonstrate perhaps surprisingly that explicitly enforcing an even sparser activation via Top-K thresholding with a small value of k brings a collection of desired but missing properties for Transformers, namely less sensitivity to noisy training data, more robustness to input corruptions, and better calibration for their prediction confidence. View details
    Serving Graph Compression for Graph Neural Networks
    Cho-Jui Hsieh
    International Conference on Learning Representations (2023) (to appear)
    Preview abstract Serving a GNN model in online applications is challenging --- one has to propagate the information from training nodes to testing nodes to achieve the best performance, while storing the whole training set (including training graph and node features) during inference time is prohibitive for most of the real world applications. We tackle this serving space compression problem in the paper, where the goal is to compress the storage requirement for GNN serving. Given a model to be served, the proposed method constructs a small set of virtual representative nodes to replace the original training nodes, so that users just need to replace the original training set by this virtual representative set to reduce the space requirement for serving, without the need of changing the actual GNN model and the forward pass. We carefully analyze the error in the forward pass and derive simple ways to construct the node features and graph of virtual representative nodes to minimize the approximation error. Experimental results demonstrate that the proposed method can significantly reduce the serving space requirement for GNN inference. View details
    Preview abstract Negative sampling is a widely adopted technique to enable efficient training in settings with a large number of classes. Typically, negative sampling approaches aim at approximating the value or gradient of the computationally expensive loss function that takes all the negative labels into account. In this work, we study the connection between negative sampling approaches and loss modification techniques for countering label imbalance. We show that different (bias) correction strategies that accompany negative sampling approaches can have unintended consequences on the model's performance on various data sub-populations. We then propose a unified approach to tackle both sampling bias, arising from working with a subset of all negative classes, and labeling bias, which is inherently present in the data due to label-imbalance. Finally, we verify our analysis and demonstrate the utility of our unified approach through empirical evaluation on standard image classification and retrieval benchmarks. View details
    Preview abstract Knowledge distillation is an approach to improve the performance of a student model by using the knowledge of a complex teacher. Despite its success in several deep learning applications, the study of distillation is mostly confined to classification settings. In particular, the use of distillation in top-k ranking settings, where the goal is to rank k most relevant items correctly, remains largely unexplored. In this paper, we study such ranking problems through the lens of distillation. We present a framework for distillation for top-k ranking and establish connections with the existing ranking methods. The core idea of this framework is to preserve the ranking at the top by matching the k largest scores of student and teacher while penalizing large scores for items ranked low by the teacher. Building on our framework, we develop a novel distillation approach, RankDistil, specifically catered towards ranking problems with a large number of items to rank. Finally, we conduct experiments which demonstrate that RankDistil yields benefits over commonly used baselines for ranking problems. View details
    A Field Guide to Federated Optimization
    Jianyu Wang
    Gauri Joshi
    Maruan Al-Shedivat
    Galen Andrew
    A. Salman Avestimehr
    Katharine Daly
    Deepesh Data
    Suhas Diggavi
    Hubert Eichner
    Advait Gadhikar
    Antonious M. Girgis
    Filip Hanzely
    Chaoyang He
    Samuel Horvath
    Martin Jaggi
    Tara Javidi
    Sai Praneeth Karimireddy
    Jakub Konečný
    Sanmi Koyejo
    Tian Li
    Peter Richtarik
    Virginia Smith
    Mahdi Soltanolkotabi
    Weikang Song
    Sebastian Stich
    Ameet Talwalkar
    Hongyi Wang
    Blake Woodworth
    Honglin Yuan
    Mi Zhang
    Tong Zhang
    Chunxiang (Jake) Zheng
    Chen Zhu
    arxiv (2021)
    Preview abstract Federated learning and analytics are a distributed approach for collaboratively learning models (or statistics) from decentralized data, motivated by and designed for privacy protection. The distributed learning process can be formulated as solving federated optimization problems, which emphasize communication efficiency, data heterogeneity, compatibility with privacy and system requirements, and other constraints that are not primary considerations in other problem settings. This paper provides recommendations and guidelines on formulating, designing, evaluating and analyzing federated optimization algorithms through concrete examples and practical implementation, with a focus on conducting effective simulations to infer real-world performance. The goal of this work is not to survey the current literature, but to inspire researchers and practitioners to design federated learning algorithms that can be used in various practical applications. View details
    Preview abstract We consider learning a multi-class classification model in the federated setting, where each user has access to the positive data associated with only a single class. As a result, during each federated learning round, the users need to locally update the classifier without having access to the features and the model parameters for the negative classes. Thus, naively employing conventional decentralized learning such as the distributed SGD or Federated Averaging may lead to trivial or extremely poor classifiers. In particular, for the embedding based classifiers, all the class embeddings might collapse to a single point. To address this problem, we propose a generic framework for training with only positive labels, namely Federated Averaging with Spreadout (FedAwS), where the server imposes a geometric regularizer after each round to encourage classes to be spreadout in the embedding space. We show, both theoretically and empirically, that FedAwS can almost match the performance of conventional learning where users have access to negative labels. We further extend the proposed method to the settings with large output spaces. View details
    Preview abstract We consider the large-scale query-document retrieval problem: given a query (e.g., a question), return the set of relevant documents (e.g., paragraphs containing the answer) from a large document corpus. This problem is often solved in two steps. The retrieval phase first reduces the solution space, returning a subset of candidate documents. The scoring phase then re-ranks the documents. Critically, the retrieval algorithm not only desires high recall but also requires to be highly efficient, returning candidates in time sublinear to the number of documents. Unlike the scoring phase witnessing significant advances recently due to the BERT-style pre-training tasks on cross-attention models, the retrieval phase remains less well studied. Most previous works rely on classic Information Retrieval (IR) methods such as BM-25 (token matching + TF-IDF weights). These models only accept sparse handcrafted features and can not be optimized for different downstream tasks of interest. In this paper, we conduct a comprehensive study on the embedding-based retrieval models. We show that the key ingredient of learning a strong embedding-based Transformer model is the set of pre-training tasks. With adequately designed paragraph-level pre-training tasks, the Transformer models can remarkably improve over the widely-used BM-25 as well as embedding models without Transformers. The paragraph-level pre-training tasks we studied are Inverse Cloze Task (ICT), Body First Selection (BFS), Wiki Link Prediction (WLP), and the combination of all three. View details
    Preview abstract Large Transformer models have achieved impressive performance in many natural language tasks. In particular, Transformer based language models have been shown to have great capabilities in encoding factual knowledge in their vast amount of parameters. While the tasks of improving the memorization and generalization of Transformers have been widely studied, it is not well known how to make transformers forget specific old facts and memorize new ones. In this paper, we propose a new task of \emph{explicitly modifying specific factual knowledge in Transformer models while ensuring the model performance does not degrade on the unmodified facts}. This task is useful in many scenarios, such as updating stale knowledge, protecting privacy, and eliminating unintended biases stored in the models. We benchmarked several approaches that provide natural baseline performances on this task. This leads to the discovery of key components of a Transformer model that are especially effective for knowledge modifications. The work also provides insights into the role that different training phases (such as pretraining and fine-tuning) play towards memorization and knowledge modification. View details
    Preview abstract Label smoothing has been shown to be an effective regularization strategy in classification, that prevents overfitting and helps in label de-noising. However, extending such methods directly to seq2seq settings, such as Machine Translation, has been hindered by the large target output space, making it intractable to apply label smoothing over all possible outputs. Most existing approaches for seq2seq settings either do token level smoothing, or smooth over sequences generated by randomly substituting tokens in the target sequence. Unlike these works, in this paper, we propose a technique that smooths over \emph{well formed} relevant sequences that not only have sufficient n-gram overlap with the target sequence, but are also \emph{semantically similar}. Our method shows a consistent and significant improvement over the state-of-the-art techniques on different datasets. View details
    Preview abstract We consider the problem of retrieving the most relevant labels for a given input when the size of the output space is very large. Retrieval methods are modeled as set-valued classifiers which output a small set of classes for each input, and a mistake is made if the label is not in the output set. Despite its practical importance, a statistically principled, yet practical solution to this problem is largely missing. To this end, we first define a family of surrogate losses and show that they are calibrated and convex under certain conditions on the loss parameters and data distribution, thereby establishing a statistical and analytical basis for using these losses. Furthermore, we identify a particularly intuitive class of loss functions in the aforementioned family and show that they are amenable to practical implementation in the large output space setting (i.e. computation is possible without evaluating scores of all labels) by developing a technique called Stochastic Negative Mining. We also provide generalization error bounds for the losses in the family. Finally, we conduct experiments which demonstrate that Stochastic Negative Mining yields benefits over commonly used negative sampling approaches. View details
    Preview abstract Linear encoding of sparse vectors is widely popular, but is commonly data-independent -- missing any possible extra (but a-priori unknown) structure beyond sparsity. In this paper we present a new method to learn linear encoders that adapt to data, while still performing well with the widely used ℓ1 decoder. The convex ℓ1 decoder prevents gradient propagation as needed in standard gradient-based training. Our method is based on the insight that unrolling the convex decoder into T projected subgradient steps can address this issue. Our method can be seen as a data-driven way to learn a compressed sensing measurement matrix. We compare the empirical performance of 10 algorithms over 6 sparse datasets (3 synthetic and 3 real). Our experiments show that there is indeed additional structure beyond sparsity in the real datasets. Our method is able to discover it and exploit it to create excellent reconstructions with fewer measurements (by a factor of 1.1-3x) compared to the previous state-of-the-art methods. We illustrate an application of our method in learning label embeddings for extreme multi-label classification. Our experiments show that our method is able to match or outperform the precision scores of SLEEC, which is one of the state-of-the-art embedding-based approaches for extreme multi-label learning. View details
    Sampled softmax with random fourier features
    Jiecao (Jack) Chen
    Advances in Neural Information Processing Systems (NeurIPS) (2019)
    Preview abstract The computational cost of training with softmax cross entropy loss grows linearly with the number of classes. For the settings where a large number of classes are involved, a common method to speed up training is to sample a subset of classes and utilize an estimate of the gradient based on these classes, known as the \emph{sampled softmax} method. However, the sampled softmax provides a biased estimate of the gradient unless the samples are drawn from the exact softmax distribution, which is again expensive to compute. Therefore, a widely employed practical approach (without theoretical justification) involves sampling from a simpler distribution in the hope of approximating the exact softmax distribution. In this paper, we develop the first theoretical understanding of the role that different sampling distributions play in determining the quality of sampled softmax. Motivated by our analysis and the work on kernel-based sampling, we propose the {\em Random Fourier Softmax} (RF-softmax) method that utilizes the powerful Random Fourier features to enable more efficient and accurate sampling from the (approximate) softmax distribution. We show that RF-softmax leads to low biased estimation in terms of both the full softmax distribution and the full softmax gradient. Furthermore, the cost of RF-softmax scales only logarithmically with the number of classes. View details
    Dual Decomposition for Fast Learning in Large Output Spaces
    Ian Yen
    Dan Holtmann-Rice
    Pradeep Ravikumar
    International Conference on Machine Learning (2018)
    Preview abstract For problems with large output spaces, evaluation of the loss function and its gradient are expensive, typically taking linear time in the size of the output space. Recently, methods have been developed to speed up learning via efficient data structures for Nearest-Neighbor Search (NNS) or Maximum Inner-Product Search (MIPS). However, the performance of such data structures typically degrades in high dimensions. In this work, we propose a novel technique to reduce the intractable high dimensional search problem to several much more tractable lower dimensional ones via dual decomposition of the loss function. At the same time, we demonstrate guaranteed convergence to the original loss via a greedy message passing procedure. In our experiments on multiclass and multilabel classification with hundreds of thousands of classes, as well as training skip-gram word embeddings with a vocabulary size of half a million, our technique consistently improves the accuracy of search-based gradient approximation methods and outperforms sampling-based gradient approximation methods by a large margin. View details
    On binary embedding using circulant matrices
    Aditya Bhaskara
    Yunchao Gong
    Shih-Fu Chang
    JMLR (2018)
    Preview abstract Binary embeddings provide efficient and powerful ways to perform operations on large scale data. However binary embedding typically requires long codes in order to preserve the discriminative power of the input space. Thus binary coding methods traditionally suffer from high computation and storage costs in such a scenario. To address this problem, we propose Circulant Binary Embedding (CBE) which generates binary codes by projecting the data with a circulant matrix. The circulant structure allows us to use Fast Fourier Transform algorithms to speed up the computation. For obtaining k-bit binary codes from d-dimensional data, our method improves the time complexity from O(dk) to O(dlogd), and the space complexity from O(dk) to O(d). We study two settings, which differ in the way we choose the parameters of the circulant matrix. In the first, the parameters are chosen randomly and in the second, the parameters are learned using the data. For randomized CBE, we give a theoretical analysis comparing it with binary embedding using an unstructured random projection matrix. The challenge here is to show that the dependencies in the entries of the circulant matrix do not lead to a loss in performance. In the second setting, we design a novel time-frequency alternating optimization to learn data-dependent circulant projections, which alternatively minimizes the objective in original and Fourier domains. In both the settings, we show by extensive experiments that the CBE approach gives much better performance than the state-of-the-art approaches if we fix a running time, and provides much faster computation with negligible performance degradation if we fix the number of bits in the embedding. View details
    Preview abstract Distributed stochastic gradient descent is an important subroutine in distributed learning. A setting of particular interest is when the clients are mobile devices, where two important concerns are communication efficiency and the privacy of the clients. Several recent works have focused on reducing the communication cost or introducing privacy guarantees, but none of the proposed communication efficient methods are known to be privacy preserving and none of the known privacy mechanisms are known to be communication efficient. To this end, we study algorithms that achieve both communication efficiency and differential privacy. For d variables and n \approx d clients, the proposed method uses \cO(\log \log(nd)) bits of communication per client per coordinate and ensures constant privacy. We also improve previous analysis of the \emph{Binomial mechanism} showing that it achieves nearly the same utility as the Gaussian mechanism, while requiring fewer representation bits, which can be of independent interest. View details
    Preview abstract Recurrent neural network language models (RNNLM) and Long Short Term Memory (LSTM) LMs, a variant of RNN LMs, have been shown to outperform traditional N-gram LMs on speech recognition tasks. However, these models are computationally more expensive than N-gram LMs for decoding, and thus, challenging to integrate into speech recognizers. Recent research has proposed the use of lattice-rescoring algorithms using RNNLMs and LSTMLMs as an efficient strategy to integrate these models into a speech recognition system. In this paper, we evaluate existing lattice rescoring algorithms along with a few of our own novel variants on a Youtube speech recognition task. Lattice rescoring using LSTMLMs reduces the word error rate (WER) for this task by about 6\% relative to the WER obtained using an N-gram LM. View details
    Preview abstract Robust covariant local feature detectors are important for detecting local features that are (1) discriminative of the image content and (2) can be repeatably detected at consistent locations when the image undergoes diverse transformations. Such detectors are critical for applications such as image search and scene reconstruction. Many learning-based local feature detectors address one of these two problems while overlooking the other. In this work, we propose a novel learning-based method to simultaneously address both issues. Specifically, we extend the previously proposed covariant constraint by defining the concepts of “standard patch” and “canonical feature” and leverage these to train a novel robust covariant detector. We show that the introduction of these concepts greatly simplifies the learning stage of the covariant detector, and also makes the detector much more robust. Extensive experiments show that our method outperforms previous handcrafted and learning-based detectors by large margins in terms of repeatability. View details
    Preview abstract We propose a simple, yet powerful regularization technique that can be used to significantly improve both the pairwise and triplet losses in learning local feature descriptors. The idea is that in order to fully utilize the expressive power of the descriptor space, good local feature descriptors should be sufficiently “spread-out” over the space. In this work, we propose a regularization term to maximize the spread in feature descriptor inspired by the property of uniform distribution. We show that the proposed regularization with triplet loss outperforms existing Euclidean distance based descriptor learning techniques by a large margin. As an extension, the proposed regularization technique can also be used to improve image-level deep feature embedding. View details
    Preview abstract We propose a multiscale quantization approach for fast similarity search on large, high-dimensional datasets. The key insight of the approach is that quantization methods, in particular product quantization, perform poorly when there is large variance in the norms of the data points. This is a common scenario for real-world datasets, especially when doing product quantization of residuals obtained from coarse vector quantization. To address this issue, we propose a multiscale formulation where we learn a separate scalar quantizer of the residual norm scales. All parameters are learned jointly in a stochastic gradient descent framework to minimize the overall quantization error. We provide theoretical motivation for the proposed technique and conduct comprehensive experiments on two large-scale public datasets, demonstrating substantial improvements in recall over existing state-of-the-art methods. View details
    Preview abstract Motivated by the need for distributed optimization algorithms with low communication cost, we study communication efficient algorithms to perform distributed mean estimation. We study the scenarios in which each client sends one bit per dimension. We first show that for d dimensional data with n clients, a naive stochastic rounding approach yields a mean squared error Theta(d/n). We then show by applying a structured random rotation of the data (an O(dlogd) algorithm), the error can be reduced to O(logd/n). The methods we show in this paper do not depend on the distribution of the data. View details
    Preview abstract We introduce a novel, data-driven way for predicting battery consumption of apps. The state-of-the-art models used to blame battery consumption on apps are based on micro-benchmark experiments. These experiments are carried out on controlled setups where one can measure how much battery is consumed by each internal resource (CPU, bluetooth, WiFi...). The battery blame allocated to an app is simply the sum of the blames of the resources consumed by the app. We argue that this type of models do not capture the way phones work "in the wild" and propose instead to train a regression model using data collected from logs. We show that this type of learning is correct in the sense that under some assumptions, we can recover the true battery discharge rate of each component. We present experimental results where we consistently do better predictions than a model trained on micro-benchmarks. View details
    Preview abstract Federated Learning is a machine learning setting where the goal is to train a high-quality centralized model with training data distributed over a large number of clients each with unreliable and relatively slow network connections. We consider learning algorithms for this setting where on each round, each client independently computes an update to the current model based on its local data, and communicates this update to a central server, where the client-side updates are aggregated to compute a new global model. The typical clients in this setting are mobile phones, and communication efficiency is of utmost importance. In this paper, we propose two ways to reduce the uplink communication costs. The proposed methods are evaluated on the application of training a deep neural network to perform image classification. Our best approach reduces the upload communication required to train a reasonable model by two orders of magnitude. View details
    Preview abstract We present an intriguing discovery related to Random Fourier Features: in Gaussian kernel approximation, replacing the random Gaussian matrix by a properly scaled random orthogonal matrix significantly decreases kernel approximation error. We call this technique Orthogonal Random Features (ORF), and provide theoretical and empirical justification for this behavior. Motivated by this discovery, we further propose Structured Orthogonal Random Features (SORF), which uses a class of structured discrete orthogonal matrices to speed up the computation. The method reduces the time cost from O(d^2) to O(dlogd), where d is the data dimensionality, with almost no compromise in kernel approximation quality compared to ORF. Experiments on several datasets verify the effectiveness of ORF and SORF over the existing methods. We also provide discussions on using the same type of discrete orthogonal structure for a broader range of applications. View details
    Spherical Random Features for Polynomial Kernels
    Jeffrey Pennington
    Neural Information Processing Systems (NIPS) (2015)
    Preview abstract Compact explicit feature maps provide a practical framework to scale kernel methods to large-scale learning, but deriving such maps for many types of kernels remains a challenging open problem. Among the commonly used kernels for nonlinear classification are polynomial kernels, for which low approximation error has thus far necessitated explicit feature maps of large dimensionality, especially for higher-order polynomials. Meanwhile, because polynomial kernels are unbounded, they are frequently applied to data that has been normalized to unit l2 norm. The question we address in this work is: if we know a priori that data is normalized, can we devise a more compact map? We show that a putative affirmative answer to this question based on Random Fourier Features is impossible in this setting, and introduce a new approximation paradigm, Spherical Random Fourier (SRF) features, which circumvents these issues and delivers a compact approximation to polynomial kernels for data on the unit sphere. Compared to prior work, SRF features are less rank-deficient, more compact, and achieve better kernel approximation, especially for higher-order polynomials. The resulting predictions have lower variance and typically yield better classification accuracy. View details
    An Exploration of Parameter Redundancy in Deep Networks with Circulant Projections
    Yu Cheng
    Rogerio Feris
    Shih-Fu Chang
    International Conference on Computer Vision (ICCV) (2015)
    Preview abstract We explore the redundancy of parameters in deep neural networks by replacing the conventional linear projection in fully-connected layers with the circulant projection. The circulant structure substantially reduces memory footprint and enables the use of the Fast Fourier Transform to speed up the computation. Considering a fully-connected neural network layer with d input nodes, and d output nodes, this method improves the time complexity from O(d^2) to O(dlogd) and space complexity from O(d^2) to O(d). The space savings are particularly important for modern deep convolutional neural network architectures, where fully-connected layers typically contain more than 90% of the network parameters. We further show that the gradient computation and optimization of the circulant projections can be performed very efficiently. Our experiments on three standard datasets show that the proposed approach achieves this significant gain in storage and efficiency with minimal increase in error rate compared to neural networks with unstructured projections. View details
    Fast Orthogonal Projection Based on Kronecker Product
    Xu Zhang
    Shengjin Wang
    Shih-Fu Chang
    International Conference on Computer Vision (ICCV) (2015)
    Preview abstract We propose a family of structured matrices to speed up orthogonal projections for high-dimensional data commonly seen in computer vision applications. In this, a structured matrix is formed by the Kronecker product of a series of smaller orthogonal matrices. This achieves O(dlogd) computational complexity and O(logd) space complexity for d-dimensional data, a drastic improvement over the standard unstructured projections whose computational and space complexities are both O(d^2). We also introduce an efficient learning procedure for optimizing such matrices in a data dependent fashion. We demonstrate the significant advantages of the proposed approach in solving the approximate nearest neighbor (ANN) image search problem with both binary embedding and quantization. Comprehensive experiments show that the proposed approach can achieve similar or better accuracy as the existing state-of-the-art but with significantly less time and memory. View details
    Circulant Binary Embedding
    Yunchao Gong
    Shih-Fu Chang
    International Conference on Machine Learning (ICML) (2014)
    Preview abstract Binary embedding of high-dimensional data requires long codes to preserve the discriminative power of the input space. Traditional binary coding methods often suffer from very high computation and storage costs in such a scenario. To address this problem, we propose Circulant Binary Embedding (CBE) which generates binary codes by projecting the data with a circulant matrix. The circulant structure enables the use of Fast Fourier Transformation to speed up the computation. Compared to methods that use unstructured matrices, the proposed method improves the time complexity from O(d^2) to O(dlogd), and the space complexity from O(d^2) to O(d) where d is the input dimensionality. We also propose a novel time-frequency alternating optimization to learn data-dependent circulant projections, which alternatively minimizes the objective in original and Fourier domains. We show by extensive experiments that the proposed approach gives much better performance than the state-of-the-art approaches for fixed time, and provides much faster computation with no performance degradation for fixed number of bits. View details
    pSVM for Learning with Label Proportions
    Dong Liu
    Tony Jebara
    Shih-Fu Chang
    International Conference on Machine Learning (ICML) (2013)
    Tamp: A Library for Compact Deep Neural Networks with Structured Matrices
    Bingchen Gong
    Shih-Fu Chang
    ACM Multimedia (ACMMM) (2016)