Jump to Content
Zhe Zhao

Zhe Zhao

I am a Research Scientist at Google. I received my PhD in Department of Computer Science and Engineering at University of Michigan, Ann Arbor. My research interests focus on applied machine learning, data mining, and information retrieval. Personal webiste.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    Preview abstract The sparse Mixture-of-Experts (Sparse-MoE) framework efficiently scales up model capacity in various domains, such as natural language processing and vision. Sparse-MoEs select a subset of the “experts” (thus, only a portion of the overall network) for each input sample using a sparse, trainable gate. Existing sparse gates are prone to convergence and performance issues when training with first-order optimization methods. In this paper, we introduce two improvements to current MoE approaches. First, we propose a new sparse gate: COMET, which relies on a novel tree-based mechanism. COMET is differentiable, can exploit sparsity to speed up computation, and outperforms state-of-the-art gates. Second, due to the challenging combinatorial nature of sparse expert selection, first-order methods are typically prone to low-quality solutions. To deal with this challenge, we propose a novel, permutation-based local search method that can complement first-order methods in training any sparse gate, e.g., Hash routing, Top-k, DSelect-k, and COMET. We show that local search can help networks escape bad initializations or solutions. We performed large-scale experiments on various domains, including recommender systems, vision, and natural language processing. On standard vision and recommender systems benchmarks, COMET+ (COMET with local search) achieves up to 13% improvement in ROC AUC over popular gates, e.g., Hash routing and Top-k, and up to 9% over prior differentiable gates e.g., DSelect-k. When Top-k and Hash gates are combined with local search, we see up to 100× reduction in the budget needed for hyperparameter tuning. Moreover, for language modeling, our approach improves over the state-of-the-art MoEBERT model for distilling BERT on 5/7 GLUE benchmarks as well as SQuAD dataset. View details
    Fast as CHITA: Neural Network Pruning with Combinatorial Optimization
    Riade Benbaki
    Wenyu Chen
    Meng Xiang
    Natalia Ponomareva
    Rahul Mazumder
    ICML 2023 (2023)
    Preview abstract The sheer size of modern neural networks makes model serving a serious computational challenge. A popular class of compression techniques overcomes this challenge by pruning or sparsifying the weights of pretrained networks. While useful, these techniques often face serious tradeoffs between computational requirements and compression quality. In this work, we propose a novel optimization-based pruning framework that considers the combined effect of pruning (and updating) multiple weights subject to a sparsity constraint. Our approach, CHITA, extends the classical Optimal Brain Surgeon framework and results in significant improvements in speed, memory, and performance over existing optimization-based approaches for network pruning. CHITA's main workhorse performs combinatorial optimization updates on a memory-friendly representation of local quadratic approximation(s) of the loss function. On a standard benchmark of pretrained models and datasets, CHITA leads to significantly better sparsity-accuracy tradeoffs than competing methods. For example, for MLPNet with only 2% of the weights retained, our approach improves the accuracy by 63% relative to the state of the art. Furthermore, when used in conjunction with fine-tuning SGD steps, our method achieves significant accuracy gains over the state-of-the-art approaches. View details
    Can Small Heads Help? Understanding and Improving Multi-Task Generalization
    Christopher Fifty
    Dong Lin
    Li Wei
    Lichan Hong
    Yuyan Wang
    the WebConf 2022 (2022)
    Preview abstract A goal for multi-task learning from a multi-objective optimization perspective is to find the Pareto solutions that are not dominated by others. In this paper, we provide some insights on understanding the trade-off between Pareto efficiency and generalization, as a result of parameterization in deep learning: as a multi-objective optimization problem, enough parameterization is needed for handling task conflicts in a constrained solution space; however, from a multi-task generalization perspective, over-parameterization undermines the benefit of learning a shared representation which helps harder tasks or tasks with limited training examples. A delicate balance between multi-task generalization and multi-objective optimization is therefore needed for finding a better trade-off between efficiency and generalization. To this end, we propose a method of under-parameterized self-auxiliaries for multi-task models to achieve the best of both worlds. It is model-agnostic, task-agnostic and works with other multi-task learning algorithms. Empirical results show our method improves Pareto efficiency over existing popular algorithms on several multi-task applications. View details
    Preview abstract In this paper, we demonstrate that information retrieval can be accomplished with a single Transformer, in which all information about the corpus is encoded in the parameters of the model. To this end, we introduce the Differentiable Search Index (DSI), a new paradigm that learns a text-to-text model that maps string queries directly to relevant docids; in other words, a DSI model answers queries directly using only its parameters, dramatically simplifying the whole retrieval process. We study variations in how documents and their identifiers are represented, variations in training procedures, and the interplay between models and corpus sizes. Experiments demonstrate that given appropriate design choices, DSI significantly outperforms strong baselines such as dual encoder models. Moreover, DSI demonstrates strong generalization capabilities, outperforming a BM25 baseline in a zero-shot setup. View details
    Preview abstract Prompt-tuning is becoming a new paradigm for finetuning pre-trained language models in a parameter-efficient way. Here, we explore the use of HyperNetworks to generate prompts. We propose a novel architecture of HyperPrompt: prompt-based task-conditioned parameterization of self-attention in Transformers. We show that HyperPrompt is very competitive against strong multi-task learning baselines with only 1% of additional task-conditioning parameters. The prompts are end-to-end learnable via generation by a HyperNetwork. The additional parameters scale sub-linearly with the number of downstream tasks, which makes it very parameter efficient for multi-task learning. Hyper-Prompt allows the network to learn task-specific feature maps where the prompts serve as task global memories. Information sharing is enabled among tasks through the HyperNetwork to alleviate task conflicts during co-training. Through extensive empirical experiments, we demonstrate that HyperPrompt can achieve superior performances over strong T5 multi-task learning base-lines and parameter-efficient adapter variants including Prompt-Tuning on Natural Language Understanding benchmarks of GLUE and Super-GLUE across all the model sizes explored. View details
    DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning
    Maheswaran Sathiamoorthy
    Yihua Chen
    Rahul Mazumder
    Lichan Hong
    35th Conference on Neural Information Processing Systems (NeurIPS 2021) (2021)
    Preview
    Preview abstract Achieving state-of-the-art performance on natural language understanding tasks typically relies on fine-tuning a fresh model for every task. Consequently, this approach leads to a higher overall parameter cost, along with higher technical maintenance for serving multiple models. Learning a single multi-task model that is able to do well for all the tasks has been a challenging and yet attractive proposition. In this paper, we propose \textsc{HyperGrid}, a new approach for highly effective multi-task learning. The proposed approach is based on a decomposable hypernetwork that learns grid-wise projections, which helps to specialize regions in weight matrices for different tasks. In order to construct the proposed hyper projection, our method learns the interactions and composition between a global state and a local task-specific state. We apply our proposed \textsc{HyperGrid} on the current state-of-the-art T5 model, yielding optimistic and strong gains across GLUE and SuperGLUE benchmarks when trained in a single model multi-tasking setup. Our method helps to bridge the gap between the single-task finetune methods and the single model multi-tasking approaches View details
    Multitask Mixture of Sequential Experts for User Activity Streams
    Yicheng Cheng
    Jingzheng Qin
    26TH ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) (2020)
    Preview abstract Multi-task deep learning has been an actively researched topic, and it has been used in many real-world systems for user activities and content recommendation. While most of the multi-task model architectures proposed to date focus on using non-sequential input features (e,g. query and context), input data is often sequential in real-world web application scenarios. For example, user behavior streams, such as user search logs in search systems, are naturally a temporal sequence. Modeling user sequential behaviors as explicit sequential representations can empower the multi-task model to incorporate temporal dependencies, thus predicting future user behavior more accurately. In this work, we study the challenging problem of how to model sequential user behavior in the neural multi-task learning settings. Our major contribution is a novel framework, Mixture of Sequential Experts (MoSE). It explicitly models sequential user behavior using Long Short-Term Memory (LSTM) in the state-of-art Multi-gate Mixture-of-Expert multi-task modeling framework. In experiments, we show the effectiveness of the MoSE architecture over seven alternative architectures on both synthetic and noisy real-world user data in Google Apps. We also demonstrate the effectiveness and flexibility of the MoSE architecture in a real-world decision making engine in GMail, by trading off between search quality and resource costs. View details
    Fairness in Recommendation Ranking through Pairwise Comparisons
    Alex Beutel
    Tulsee Doshi
    Hai Qian
    Li Wei
    Yi Wu
    Lukasz Heldt
    Lichan Hong
    Cristos Goodrow
    KDD (2019)
    Preview abstract Recommender systems are one of the most pervasive applications of machine learning in industry, with many services using them to match users to products or information. As such it is important to ask: what are the possible fairness risks, how can we quantify them, and how should we address them? In this paper we offer a set of novel metrics for evaluating algorithmic fairness concerns in recommender systems. In particular we show how measuring fairness based on pairwise comparisons from randomized experiments provides a tractable means to reason about fairness in rankings from recommender systems. Building on this metric, we offer a new regularizer to encourage improving this metric during model training and thus improve fairness in the resulting rankings. We apply this pairwise regularization to a large-scale, production recommender system and show that we are able to significantly improve the system's pairwise fairness. View details
    Preview abstract Many recommendation systems need to retrieve and score items from a large corpus. A common approach to handle data sparsity and power-law item distribution is to learn item representations from its content features. Apart from many content-aware systems based on matrix factorization, in this paper, we consider a modeling framework with two-tower neural networks where one network called item tower is used to encode a wide variety of item features. Optimizing loss functions calculated from in-batch negatives, which are items sampled in a random batch, is a general recipe of training such two-tower models. However, batch loss is subject to sampling bias which could severely restrict model performance, particularly in the case of power-law distribution. In this work, we present a novel algorithm for estimating item frequency from streaming data. Our main idea is to sketch and estimate item occurrences via gradient descent. Through theoretical analysis and simulations, we show that the proposed algorithm can work without fixed item vocabulary, and is capable of producing unbiased estimation and being adaptive to item distribution change. We then apply the sampling-bias-corrected modeling approach to build a large scale retrieval system called Neural Deep Retrieval (NDR) for YouTube recommendations. The system is deployed to retrieve personalized suggestions from a corpus of tens of millions videos. We demonstrate the effectiveness of sampling bias correction through offline experiments on two real-world datasets. We also conduct live A/B testings to show that the NDR system leads to improved recommendation quality for YouTube. View details
    Recommending What Video to Watch Next: A Multitask Ranking System
    Aditee Ajit Kumthekar
    Aniruddh Nath
    Li Wei
    Lichan Hong
    Mahesh Sathiamoorthy
    Shawn Andrews
    Recsys 2019 (2019)
    Preview abstract In this paper, we introduce a large scale multi-objective ranking system for recommending what video to watch next on an industrial video sharing platform. The system faces many real-world challenges, including the presence of multiple competing ranking objectives, as well as implicit selection biases in user feedback. To tackle these challenges, we explored a variety of soft-parameter sharing techniques such as Multi-gate Mixture-of-Experts so as to efficiently optimize for multiple ranking objectives. Additionally, we mitigated the selection biases by adopting a Wide & Deep frame- work. We demonstrated that our proposed techniques can lead to substantial improvements on recommendation quality on one of the world’s largest video sharing platforms. View details
    Preview abstract Neural-based multi-task learning has been successfully used in many real-world large-scale applications such as recommendation systems. For example, in movie recommendations, beyond providing users movies which they tend to purchase and watch, the system might also optimize for users liking the movies afterwards. With multi-task learning, we aim to build a single model that learns these multiple goals and tasks simultaneously. However, the prediction quality of commonly used multi-task models is often sensitive to the relationships between tasks. It is therefore important to study the modeling tradeoffs between task-specific objectives and inter-task relationships. In this work, we propose a novel multi-task learning approach, Multi-gate Mixture-of-Experts (MMoE), which explicitly learns to model task relationships from data. We adapt the Mixture-of-Experts (MoE) structure to multi-task learning by sharing the expert submodels across all tasks, while also having a gating network trained to optimize each task. To validate our approach on data with different levels of task relatedness, we first apply it to a synthetic dataset where we control the task relatedness. We show that the proposed approach performs better than baseline methods when the tasks are less related. We also show that the MMoE structure results in an additional trainability benefit, depending on different levels of randomness in the training data and model initialization. Furthermore, we demonstrate the performance improvements by MMoE on real tasks including a binary classification benchmark, and a large-scale content recommendation system at Google. View details
    Preview abstract How can we learn classifier that is ``fair'' for a protected or sensitive group, when we do not know if the input to the classifier affects the protected group? How can we train such a classifier when data on the protected group is difficult to attain? In many settings, finding out the sensitive input attribute can be prohibitively expensive even during model training, and possibly impossible during model serving. For example, in recommender systems, if we want to predict if a user will click on a given recommendation, we often do not know many attributes of the user, e.g., race or age, and many attributes of the content are hard to determine, e.g., the language or topic. Thus, it is not feasible to use a different classifier calibrated based on knowledge of the sensitive attribute. Here, we use an adversarial training procedure to remove information about the sensitive attribute from the latent representation learned by a neural network. In particular, we study how the choice of data for the adversarial training effects the resulting fairness properties. We find two interesting results: a remarkably small amount of data is needed to train these models, and there is still a gap between the theoretical implications and the empirical results. View details
    Improving User Topic Interest Profiles by Behavior Factorization
    Lichan Hong
    Proceedings of the 24th International Conference on World Wide Web, International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland (2015), pp. 1406-1416
    Preview abstract Many recommenders aim to provide relevant recommendations to users by building personal topic interest profiles and then using these profiles to find interesting contents for the user. In social media, recommender systems build user profiles by directly combining users' topic interest signals from a wide variety of consumption and publishing behaviors, such as social media posts they authored, commented on, +1'd or liked. Here we propose to separately model users' topical interests that come from these various behavioral signals in order to construct better user profiles. Intuitively, since publishing a post requires more effort, the topic interests coming from publishing signals should be more accurate of a user's central interest than, say, a simple gesture such as a +1. By separating a single user's interest profile into several behavioral profiles, we obtain better and cleaner topic interest signals, as well as enabling topic prediction for different types of behavior, such as topics that the user might +1 or comment on, but might never write a post on that topic. To do this at large scales in Google+, we employed matrix factorization techniques to model each user's behaviors as a separate example entry in the input user-by-topic matrix. Using this technique, which we call "behavioral factorization", we implemented and built a topic recommender predicting user's topical interests using their actions within Google+. We experimentally showed that we obtained better and cleaner signals than baseline methods, and are able to more accurately predict topic interests as well as achieve better coverage. View details
    No Results Found