Heng-Tze Cheng

Heng-Tze Cheng

Heng-Tze Cheng is a Technical Lead Manager and Senior Staff Software Engineer on the Google Brain team, part of Google Research & AI. Heng-Tze currently leads a research team focusing on Neural Sequence Modeling research for Task-oriented Dialogues, Personalized Semantic Search, and Recommender Systems productionized across Google, such as Google Duplex Assistant, YouTube, and more. Heng-Tze also founded and led the Wide & Deep Learning project in TensorFlow, and has worked on large-scale machine learning platforms that are widely used for retrieval, ranking, and recommender systems. Prior to joining Google in 2014, Heng-Tze received his Ph.D. from Carnegie Mellon University in 2013 and B.S. from National Taiwan University in 2008. His research interests range across machine learning, information retrieval, user behavior modeling, and human activity recognition.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Prompt-tuning is becoming a new paradigm for finetuning pre-trained language models in a parameter-efficient way. Here, we explore the use of HyperNetworks to generate prompts. We propose a novel architecture of HyperPrompt: prompt-based task-conditioned parameterization of self-attention in Transformers. We show that HyperPrompt is very competitive against strong multi-task learning baselines with only 1% of additional task-conditioning parameters. The prompts are end-to-end learnable via generation by a HyperNetwork. The additional parameters scale sub-linearly with the number of downstream tasks, which makes it very parameter efficient for multi-task learning. Hyper-Prompt allows the network to learn task-specific feature maps where the prompts serve as task global memories. Information sharing is enabled among tasks through the HyperNetwork to alleviate task conflicts during co-training. Through extensive empirical experiments, we demonstrate that HyperPrompt can achieve superior performances over strong T5 multi-task learning base-lines and parameter-efficient adapter variants including Prompt-Tuning on Natural Language Understanding benchmarks of GLUE and Super-GLUE across all the model sizes explored. View details
    LaMDA: Language Models for Dialog Applications
    Aaron Daniel Cohen
    Alena Butryna
    Alicia Jin
    Apoorv Kulshreshtha
    Ben Zevenbergen
    Chung-ching Chang
    Cosmo Du
    Daniel De Freitas Adiwardana
    Dehao Chen
    Dmitry (Dima) Lepikhin
    Erin Hoffman-John
    Hongrae Lee
    Igor Krivokon
    James Qin
    Jamie Hall
    Joe Fenton
    Johnny Soraker
    Lora Mois Aroyo
    Maarten Paul Bosma
    Marc Joseph Pickett
    Marcelo Amorim Menegali
    Marian Croak
    Maxim Krikun
    Meredith Ringel Morris
    Noam Shazeer
    Rachel Bernstein
    Ravi Rajakumar
    Ray Kurzweil
    Romal Thoppilan
    Steven Zheng
    Taylor Bos
    Toju Duke
    Tulsee Doshi
    Vincent Y. Zhao
    Will Rusch
    Yuanzhong Xu
    arXiv(2022)
    Preview abstract We present LaMDA: Language Models for Dialog Applications. LaMDA is a family of Transformer-based neural language models specialized for dialog, which have up to 137B parameters and arepre-trained on 1.56T words of public dialog data and web text. While model scaling alone canimprove quality, it shows less improvements on safety and factual grounding. We demonstrate thatfine-tuning with annotated data and enabling the model to consult external knowledge sources canlead to significant improvements towards the two key challenges of safety and factual grounding.The first challenge, safety, involves ensuring that the model’s responses are consistent with a set ofhuman values, such as preventing harmful suggestions and unfair bias. We quantify safety using ametric based on an illustrative set of values, and we find that filtering candidate responses using aLaMDA classifier fine-tuned with a small amount of crowdworker-annotated data offers a promisingapproach to improving model safety. The second challenge, factual grounding, involves enabling themodel to consult external knowledge sources, such as an information retrieval system, a languagetranslator, and a calculator. We quantify factuality using a groundedness metric, and we find that ourapproach enables the model to generate responses grounded in known sources, rather than responsesthat merely sound plausible. Finally, we explore the use of LaMDA in the domains of education andcontent recommendations, and analyze their helpfulness and role consistency. View details
    Mondegreen: A Post-Processing Solution to Speech Recognition Error Correction for Voice Search Queries
    Ajit Apte
    Ambarish Jash
    Amol H Wankhede
    Ankit Kumar
    Ayooluwakunmi Jeje
    Dima Kuzmin
    Ellie Ka In Chio
    Harry Fung
    Jon Effrat
    Nitin Jindal
    Pei Cao
    Senqiang Zhou
    Sukhdeep S. Sodhi
    Tameen Khan
    Tarush Bali
    KDD(2021)
    Preview abstract As more and more online search queries come from voice, automatic speech recognition becomes a key component to deliver relevant search results. Errors introduced by automatic speech recognition (ASR) lead to irrelevant search results returned to the user, thus causing user dissatisfaction. In this paper, we introduce an approach, Mondegreen, to correct voice queries in text space without depending on audio signals, which may not always be available due to system constraints or privacy or bandwidth (for example, some ASR systems run on-device) considerations. We focus on voice queries transcribed via several proprietary commercial ASR systems. These queries come from users making internet, or online service search queries. We first present an analysis showing how different the language distribution coming from user voice queries is from that in traditional text corpora used to train off-the-shelf ASR systems. We then demonstrate that Mondegreen can achieve significant improvements in increased user interaction by correcting user voice queries in one of the largest search systems in Google. Finally, we see Mondegreen as complementing existing highly-optimized production ASR systems, which may not be frequently retrained and thus lag behind due to vocabulary drifts. View details
    Zero-Shot Transfer Learning for Query-Item Cold Start in Search Retrieval and Recommendations
    Ankit Kumar
    Cosmo Du
    Dima Kuzmin
    Ellie Chio
    John Roberts Anderson
    Li Zhang
    Nitin Jindal
    Pei Cao
    Ritesh Agarwal
    Steffen Rendle
    Tao Wu
    Wen Li
    CIKM(2020)
    Preview abstract Most search retrieval and recommender systems predict top-K items given a query by learning directly from a large training set of (query, item) pairs, where a query can include natural language (NL), user, and context features. These approaches fall into the traditional supervised learning framework where the algorithm trains on labeled data from the target task. In this paper, we propose a new zero-shot transfer learning framework, which first learns representations of items and their NL features by predicting (item, item) correlation graphs as an auxiliary task, followed by transferring learned representations to solve the target task (query-to-item prediction), without having seen any (query, item) pairs in training. The advantages of applying this new framework include: (1) Cold-starting search and recommenders without abundant query-item data; (2) Generalizing to previously unseen or rare (query, item) pairs and alleviating the "rich get richer" problem; (3) Transferring knowledge of (item, item) correlation from domains outside of search. We show that the framework is effective on a large-scale search and recommender system. View details
    Reinforcement Learning for Slate-based Recommender Systems: A Tractable Decomposition and Practical Methodology
    Vihan Jain
    Jing Wang
    Sanmit Narvekar
    Ritesh Agarwal
    Rui Wu
    Morgane Lustman
    Vince Gatto
    Paul Covington
    Jim McFadden
    arXiv(2019)
    Preview abstract Most practical recommender systems focus on estimating immediate user engagement without considering the long-term effects of recommendations on user behavior. Reinforcement learning (RL) methods offer the potential to optimize recommendations for long-term user engagement. However, since users are often presented with slates of multiple items---which may have interacting effects on user choice---methods are required to deal with the combinatorics of the RL action space. In this work, we address the challenge of making slate-based recommendations to optimize long-term value using RL. Our contributions are three-fold. (i) We develop SlateQ, a decomposition of value-based temporal-difference and Q-learning that renders RL tractable with slates. Under mild assumptions on user-choice behavior, we show that the long-term value (LTV) of a slate can be decomposed into a tractable function of its component item-wise LTVs. (ii) We outline a methodology that leverages existing myopic learning-based recommenders to quickly develop a recommender that handles LTV. (iii) We demonstrate our methods in simulation, and validate the scalability of decomposed TD-learning using SlateQ in live experiments on YouTube. View details
    SlateQ: A Tractable Decomposition for Reinforcement Learning with Recommendation Sets
    Vihan Jain
    Jing Wang
    Sanmit Narvekar
    Ritesh Agarwal
    Rui Wu
    Proceedings of the Twenty-eighth International Joint Conference on Artificial Intelligence (IJCAI-19), Macau, China(2019), pp. 2592-2599
    Preview abstract Reinforcement learning (RL) methods for recommender systems optimize recommendations for long-term user engagement. However, since users are often presented with slates of multiple items---which may have interacting effects on user choice---methods are required to deal with the combinatorics of the RL action space. We develop SlateQ, a decomposition of value-based temporal-difference and Q-learning that renders RL tractable with slates. Under mild assumptions on user choice behavior, we show that the long-term value (LTV) of a slate can be decomposed into a tractable function of its component item-wise LTVs. We demonstrate our methods in simulation, and validate the scalability and effectiveness of decomposed TD-learning on YouTube. View details
    TensorFlow Estimators: Managing Simplicity vs. Flexibility in High-Level Machine Learning Frameworks
    Cassandra Xia
    Clemens Mewald
    George Roumpos
    Illia Polosukhin
    Jamie Alexander Smith
    Jianwei Xie
    Lichan Hong
    Mustafa Ispir
    Philip Daniel Tucker
    Yuan Tang
    Proceedings of the 23th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, Canada(2017)
    Preview abstract We present a framework for specifying, training, evaluating, and deploying machine learning models. Our focus is to simplify writing cutting edge machine learning models in a way that enables bringing those models into production. Recognizing the fast evolution of the field of deep learning, we make no attempt to capture the design space of all possible model architectures in a DSL or similar configuration. We allow users to write code to define their models, but provide abstractions that guide developers to write models in ways conducive to productionization, as well as providing a unifying Estimator interface, a unified interface making it possible to write downstream infrastructure (distributed training, hyperparameter tuning, …) independent of the model implementation. We balance the competing demands for flexibility and simplicity by offering APIs at different levels of abstraction, making common model architectures available “out of the box”, while providing a library of utilities designed to speed up experimentation with model architectures. To make out of the box models flexible and usable across a wide range of problems, these canned Estimators are parameterized not only over traditional hyperparameters, but also using feature columns, a declarative specification describing how to interpret input data. We discuss our experience in using this framework in research and production environments, and show the impact on code health, maintainability, and development speed. View details
    TFX: A TensorFlow-Based Production-Scale Machine Learning Platform
    Akshay Naresh Modi
    Chiu Yuen Koo
    Chuan Yu Foo
    Clemens Mewald
    Denis M. Baylor
    Levent Koc
    Lukasz Lew
    Martin A. Zinkevich
    Mustafa Ispir
    Neoklis Polyzotis
    Steven Whang
    Sudip Roy
    Sukriti Ramesh
    Vihan Jain
    Xin Zhang
    KDD 2017
    Preview abstract Creating and maintaining a platform for reliably producing and deploying machine learning models requires careful orchestration of many components—a learner for generating models based on training data, modules for analyzing and validating both data as well as models, and finally infrastructure for serving models in production. This becomes particularly challenging when data changes over time and fresh models need to be produced continuously. Unfortunately, such orchestration is often done ad hoc using glue code and custom scripts developed by individual teams for specific use cases, leading to duplicated effort and fragile systems with high technical debt. We present TensorFlow Extended (TFX), a TensorFlow-based general-purpose machine learning platform implemented at Google. By integrating the aforementioned components into one platform, we were able to standardize the components, simplify the platform configuration, and reduce the time to production from the order of months to weeks, while providing platform stability that minimizes disruptions. We present the case study of one deployment of TFX in the Google Play app store, where the machine learning models are refreshed continuously as new data arrive. Deploying TFX led to reduced custom code, faster experiment cycles, and a 2% increase in app installs resulting from improved data and model analysis. View details
    Wide & Deep Learning for Recommender Systems
    Levent Koc
    Tal Shaked
    Glen Anderson
    Wei Chai
    Mustafa Ispir
    Rohan Anil
    Lichan Hong
    Vihan Jain
    Xiaobing Liu
    Hemal Shah
    arXiv:1606.07792(2016)
    Preview abstract Generalized linear models with nonlinear feature transformations are widely used for large-scale regression and classification problems with sparse inputs. Memorization of feature interactions through a wide set of cross-product feature transformations are effective and interpretable, while generalization requires more feature engineering effort. With less feature engineering, deep neural networks can generalize better to unseen feature combinations through low-dimensional dense embeddings learned for the sparse features. However, deep neural networks with embeddings can over-generalize and recommend less relevant items when the user-item interactions are sparse and high-rank. In this paper, we present Wide & Deep learning---jointly trained wide linear models and deep neural networks---to combine the benefits of memorization and generalization for recommender systems. We productionized and evaluated the system on a commercial mobile app store with over one billion active users and over one million apps. Online experiment results show that Wide & Deep significantly increased app acquisitions compared with wide-only and deep-only models. View details
    Nonparametric Discovery of Human Routine from Sensor Data
    Feng-Tso Sun
    Yi-Ting Yeh
    Cynthia Kuo
    Martin Griss
    IEEE International Conference on Pervasive Computing and Communications (PerCom)(2014)
    Preview abstract People engage in routine behaviors. Automatic routine discovery goes beyond low-level activity recognition such as sitting or standing and analyzes human behaviors at a higher level (e.g., commuting to work). With recent developments in ubiquitous sensor technologies, it becomes easier to acquire a massive amount of sensor data. One main line of research is to mine human routines from sensor data using parametric topic models such as latent Dirichlet allocation. The main shortcoming of parametric models is that it assumes a fixed, pre-specified parameter regardless of the data. Choosing an appropriate parameter usually requires an inefficient trial-and-error model selection process. Furthermore, it is even more difficult to find optimal parameter values in advance for personalized applications. In this paper, we present a novel nonparametric framework for human routine discovery that can infer high-level routines without knowing the number of latent topics beforehand. Our approach is evaluated on public datasets in two routine domains: a 34-daily-activity dataset and a transportation mode dataset. Experimental results show that our nonparametric framework can automatically learn the appropriate model parameters from sensor data without any form of model selection procedure and can outperform traditional parametric approaches for human routine discovery tasks. View details
    Towards zero-shot learning for human activity recognition using semantic attribute sequence model
    Martin Griss
    Paul Davis
    Jianguo Li
    Di You
    UbiComp '13 Proceedings of the 2013 ACM international joint conference on Pervasive and ubiquitous computing, ACM
    Preview abstract Understanding human activities is important for user-centric and context-aware applications. Previous studies showed promising results using various machine learning algorithms. However, most existing methods can only recognize the activities that were previously seen in the training data. In this paper, we present a new zero-shot learning framework for human activity recognition that can recognize an unseen new activity even when there are no training samples of that activity in the dataset. We propose a semantic attribute sequence model that takes into account both the hierarchical and sequential nature of activity data. Evaluation on datasets in two activity domains show that the proposed zero-shot learning approach achieves 70-75% precision and recall recognizing unseen new activities, and outperforms supervised learning with limited labeled data for the new classes. View details
    NuActiv: Recognizing Unseen New Activities Using Semantic Attribute-Based Learning
    Feng-Tso Sun
    Martin Griss
    Paul Davis
    Jianguo Li
    Di You
    Proceeding of the 11th Annual International Conference on Mobile Systems, Applications, and Services, ACM, New York, NY, USA(2013), pp. 361-374
    Preview abstract We study the problem of how to recognize a new human activity when we have never seen any training example of that activity before. Recognizing human activities is an essential element for user-centric and context-aware applications. Previous studies showed promising results using various machine learning algorithms. However, most existing methods can only recognize the activities that were previously seen in the training data. A previously unseen activity class cannot be recognized if there were no training samples in the dataset. Even if all of the activities can be enumerated in advance, labeled samples are often time consuming and expensive to get, as they require huge effort from human annotators or experts. In this paper, we present NuActiv, an activity recognition system that can recognize a human activity even when there are no training data for that activity class. Firstly, we designed a new representation of activities using semantic attributes, where each attribute is a human readable term that describes a basic element or an inherent characteristic of an activity. Secondly, based on this representation, a two-layer zero-shot learning algorithm is developed for activity recognition. Finally, to reinforce recognition accuracy using minimal user feedback, we developed an active learning algorithm for activity recognition. Our approach is evaluated on two datasets, including a 10-exercise-activity dataset we collected, and a public dataset of 34 daily life activities. Experimental results show that using semantic attribute-based learning, NuActiv can generalize knowledge to recognize unseen new activities. Our approach achieved up to 79% accuracy in unseen activity recognition. View details
    SensOrchestra: Collaborative Sensing for Symbolic Location Recognition
    Feng-Tso Sun
    Senaka Buthpitiya
    Martin Griss
    International Conference on Mobile Computing, Applications, and Services, 2010
    Preview abstract Symbolic location of a user, like a store name in a mall, is essential for context-based mobile advertising. Existing fingerprint-based localization using only a single phone is susceptible to noise, and has a major limitation in that the phone has to be held in the hand at all times. In this paper, we present SensOrchestra, a collaborative sensing framework for symbolic location recognition that groups nearby phones to recognize ambient sounds and images of a location collaboratively. We investigated audio and image features, and designed a classifier fusion model to integrate estimates from different phones. We also evaluated the energy consumption, bandwidth, and response time of the system. Experimental results show that SensOrchestra achieved 87.7% recognition accuracy, which reduces the error rate of single-phone approach by 2X, and eliminates the limitations on how users carry their phones. We believe general location or activity recognition systems can all benefit from this collaborative framework. View details
    Automatic Chord Recognition for Music Classification and Retrieval
    Yi-Hsuan Yang
    Yu-Ching Lin
    I-Bin Liao
    Homer H. Chen
    IEEE International Conference on Multimedia and Expo (ICME), 2008
    Preview abstract As one of the most important mid-level features of music, chord contains rich information of harmonic structure that is useful for music information retrieval. In this paper, we present a chord recognition system based on the N-gram model. The system is time-efficient, and its accuracy is comparable to existing systems. We further propose a new method to construct chord features for music emotion classification and evaluate its performance on commercial song recordings. Experimental results demonstrate the advantage of using chord features for music classification and retrieval. View details