Yun-hsuan Sung

Yun-hsuan Sung

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Despite recent progress, it has been difficult to prevent semantic hallucinations in generative Large Language Models. One common solution to this is augmenting LLMs with a retrieval system and making sure that the generated output is attributable to the retrieved information. Given this new added constraint, it is plausible to expect that the overall quality of the output will be affected, for example, in terms of fluency. Can scaling language models help? Here we examine the relationship between fluency and attribution in LLMs prompted with retrieved evidence in knowledge-heavy dialog settings. Our experiments were implemented with a set of auto-metrics that are aligned with human preferences. They were used to evaluate a large set of generations, produced under varying parameters of LLMs and supplied context. We show that larger models tend to do much better in both fluency and attribution, and that (naively) using top-k retrieval versus top-1 retrieval improves attribution but hurts fluency. We next propose a recipe that could allow smaller models to both close the gap with larger models and preserve the benefits of top-k retrieval while avoiding its drawbacks. View details
    LongT5: Efficient Text-To-Text Transformer for Long Sequences
    Joshua Ainslie
    David Uthus
    Jianmo Ni
    Yinfei Yang
    Findings of the Association for Computational Linguistics: NAACL 2022, Association for Computational Linguistics
    Preview abstract Recent work has shown that either (1) increasing the input length or (2) increasing model size can improve the performance of Transformer-based neural models. In this paper, we present a new model, called LongT5, with which we explore the effects of scaling both the input length and model size at the same time. Specifically, we integrated attention ideas from long-input transformers (ETC), and adopted pre-training strategies from summarization pre-training (PEGASUS) into the scalable T5 architecture. The result is a new attention mechanism we call Transient Global (TGlobal), which mimics ETC's local/global attention mechanism, but without requiring additional side-inputs. We are able to achieve state-of-the-art results on several summarization tasks and outperform the original T5 models on question answering tasks. View details
    Preview abstract Pre-trained representations are becoming crucial for many NLP and perception tasks. While representation learning in NLP has transitioned to training on raw text without human annotations, visual and vision-language representations still rely heavily on curated training datasets that are expensive or require expert knowledge. For vision applications, representations are mostly learned using datasets with explicit class labels such as ImageNet or OpenImages. For vision-language, popular datasets like Conceptual Captions, MSCOCO, or CLIP all involve a non-trivial data collection (and cleaning) process. This costly curation process limits the size of datasets and hence hinders the scaling of trained models. In this paper, we leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps in the Conceptual Captions dataset. A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss. We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme. Our visual representation achieves strong performance when transferred to classification tasks such as ImageNet and VTAB. The aligned visual and language representations also set new state-of-the-art results on Flickr30K and MSCOCO benchmarks, even when compared with more sophisticated cross-attention models. The representations also enable cross-modality search with complex text and text + image queries. View details
    Machine Translation Aided Bilingual Data-to-Text Generation and Semantic Parsing
    Heming Ge
    Mihir Sanjay Kale
    Oshin Agarwal
    Siamak Shakeri
    3rd Workshop on Natural Language Generation from the Semantic Web(2020)
    Preview abstract We present a system for bilingual Data-To-Text Generation and Semantic Parsing. We use a text-to-text generator to learn a single model that works for both languages on each of the tasks. The model is aided by machine translation during both pre-training and fine-tuning. We evaluate the system on WebNLG 2020 data, which consists of RDF triples in English and natural language sentences in English and Russian for both the tasks. We achieve considerable gains over monolingual models, especially on unseen relations and Russian. View details
    Self-supervised Learning for Pairwise Data Refinement
    Bowen Liang
    Wei Wang
    Zarana Parekh
    Yinfei Yang
    Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, Association for Computational Linguistics, Suzhou, China(2020), pp. 435-446 (to appear)
    Preview abstract We present a self-supervised method to refine pairwise data using the contents of the data itself.Our method is based on the cross-lingual similarity scores calculated with a dual-encoder model and using them to select data to train new dual-encoder models in an iterative way. To illustrate the functionality of our method, we apply it to the task of denoising parallel texts mined from the internet on two language pairs: en-fr and en-de. We train dual-encoder models on the refined data and test them on the BUCC bitext mining tasks. The dual-encoder models show steady performance improvement with every iteration. We also use the refined data to train machine translation models that we integrate in our method for further improvement of the dual-encoder models. The machine translation models that we evaluate are competitive against similar models trained with data filtered with a supervised approach. Our method has the advantage that, given that it is entirely self-supervised, it is well-suited to handle text data for which there is no prior knowledge about the language or where labeled clean data is not available. View details
    Universal Sentence Encoder
    Yinfei Yang
    Sheng-yi Kong
    Nan Hua
    Nicole Lyn Untalan Limtiaco
    Rhomni St. John
    Steve Yuan
    Chris Tar
    Brian Strope
    Ray Kurzweil
    In submission to: EMNLP demonstration, Association for Computational Linguistics, Brussels, Belgium(2018)
    Preview abstract We present models for encoding sentences into embedding vectors that specifically target transfer learning to other NLP tasks. The models are efficient and result in accurate performance on diverse transfer tasks. Two variants of the encoding models allow for trade-offs between accuracy and compute resources. For both variants, we investigate and report the relationship between model complexity, resource consumption, the availability of transfer task training data, and task performance. Comparisons are made with baselines that use word level transfer learning via pretrained word embeddings as well as baselines do not use any transfer learning. We find that transfer learning using sentence embeddings tends to outperform word level transfer. With transfer learning via sentence embeddings, we observe surprisingly good performance with minimal amounts of supervised training data for a transfer task. We obtain encouraging results on Word Embedding Association Tests (WEAT) targeted at detecting model bias. Our pre-trained sentence encoding models are made freely available for download and on TF Hub. View details
    Preview abstract This paper presents a computationally efficient machine-learned method for natural language response suggestion. Feed-forward neural networks using n-gram embedding features encode messages into vectors which are optimized to give message-response pairs a high dot-product value. An optimized search finds response suggestions. The method is evaluated in a large-scale commercial e-mail application, Inbox by Gmail. Compared to a sequence-to-sequence approach, the new system achieves the same quality at a small fraction of the computational requirements and latency. View details
    Preview abstract We investigate the task of modeling open-domain, multi-turn, unstructured, multi- participant, conversational dialogue. We specifically study the effect of incorporating different elements of the conversation. Unlike previous efforts, which focused on modeling messages and responses, we extend the modeling to long context and participant’s history. Our system does not rely on handwritten rules or engineered features; instead, we train deep neural networks on a large conversational dataset. In particular, we exploit the structure of Reddit comments and posts to extract 2.1 billion messages and 133 million conversations. We evaluate our models on the task of predicting the next response in a conversation, and we find that modeling both context and participants improves prediction accuracy. View details
    Recognition of Multilingual Speech in Mobile Applications
    Hui Lin
    Jui-Ting Huang
    Francoise Beaufays
    Brian Strope
    ICASSP(2012)
    Preview abstract We evaluate different architectures to recognize multilingual speech for real-time mobile applications. In particular, we show that combining the results of several recognizers greatly outperforms other solutions such as training a single large multilingual system or using an explicit language identification system to select the appropriate recognizer. Experiments are conducted on a trilingual English-French-Mandarin mobile speech task. The data set includes Google searches, Maps queries, as well as more general inputs such as email and short message dictation. Without pre-specifying the input language, the combined system achieves comparable accu- racy to that of the monolingual systems when the input language is known. The combined system is also roughly 5% absolute better than an explicit language identification approach, and 10% better than a single large multilingual system. View details
    Deploying Google Search by Voice in Cantonese
    Martin Jansche
    Pedro Moreno
    12th Annual Conference of the International Speech Communication Association (Interspeech 2011), pp. 2865-2868
    Preview abstract We describe our efforts in deploying Google search by voice for Cantonese, a southern Chinese dialect widely spoken in and around Hong Kong and Guangzhou. We collected audio data from local Cantonese speakers in Hong Kong and Guangzhou by using our DataHound smartphone application. This data was used to create appropriate acoustic models. Language models were trained on anonymized query logs from Google Web Search for Hong Kong. Because users in Hong Kong frequently mix English and Cantonese in their queries, we designed our system from the ground up to handle both languages. We report on experiments with different techniques for mapping the phoneme inventories for both languages into a common space. Based on extensive experiments we report word error rates and web scores for both Hong Kong and Guangzhou data. Cantonese Google search by voice was launched in December 2010. View details
    Recognizing English Queries in Mandarin Voice Search
    Hung-An Chang
    Brian Strope
    Francoise Beaufays
    ICASSP(2011)
    Preview
    Revisiting Graphemes with Increasing Amounts of Data
    Thad Hughes
    Francoise Beaufays
    Brian Strope
    ICASSP, IEEE(2009)
    Preview abstract Letter units, or graphemes, have been reported in the literature as a surprisingly effective substitute to the more traditional phoneme units, at least in languages that enjoy a strong correspondence between pronunciation and orthography. For English however, where letter symbols have less acoustic consistency, previously reported results fell short of systems using highly-tuned pronunciation lexicons. Grapheme units simplify system design, but since graphemes map to a wider set of acoustic realizations than phonemes, we should expect grapheme-based acoustic models to require more training data to capture these variations. In this paper, we compare the rate of improvement of grapheme and phoneme systems trained with datasets ranging from 450 to 1200 hours of speech. We consider various grapheme unit configurations, including using letter-specific, onset, and coda units. We show that the grapheme systems improve faster and, depending on the lexicon, reach or surpass the phoneme baselines with the largest training set. Index Terms— Acoustic modeling, graphemes, directory assistance, speech recognition. View details
    Hidden Conditional Random Fields for Speech Recognition
    Ph.D. Thesis, Stanford University(2010)
    Hidden Conditional Random Fields for Phone Recognition
    Dan Jurafsky
    IEEE workshop on Automatic Speech Recognition and Understanding(2009)
    Preview abstract We apply Hidden Conditional Random Fields (HCRFs) to the task of TIMIT phone recognition. HCRFs are discriminatively trained sequence models that augment conditional random fields with hidden states that are capable of representing subphones and mixture components. We extend HCRFs, which had previously only been applied to phone classification with known boundaries, to recognize continuous phone sequences. We use an N-best inference algorithm in both learning (to approximate all competitor phone sequences) and decoding (to marginalize over hidden states). Our monophone HCRFs achieve 28.3% phone error rate, outperforming maximum likelihood trained HMMs by 3.6%, maximum mutual information trained HMMs by 2.5%, and minimum phone error trained HMMs by 2.2%. We show that this win is partially due to HCRFs’ ability to simultaneously optimize discriminative language models and acoustic models, a powerful property that has important implications for speech recognition. View details
    Maximum Conditional Likelihood Linear Regression and Maximum a Posteriori for Hidden Conditional Random Fields Speaker Adaptation
    Constantinos Boulis
    Dan Jurafsky
    ICASSP(2008)
    Preview abstract This paper shows how to improve Hidden Conditional Random Fields (HCRFs) for phone classification by applying various speaker adaptation techniques. These include Maximum A Posteriori (MAP) adaptation as well as a new technique we introduce called Maximum Conditional Likelihood Linear Regression (MCLLR), a discrimina- tive variant of the widely used MLLR algorithm. In previous work, we and others have shown that HCRFs outperform even discrimina- tively trained HMMs. In this paper we show that HCRFs adapted via MCLLR or via MAP adaptation also work better than similarly adapted HMMs. We also compare MCLLR and MAP adaptation performance with different amounts of adaptation data. MCLLR adaptation performs better when the amount of adaptation data is relatively small, while MAP adaptation outperforms MCLLR with larger amounts of adaptation. View details
    Regularization, Adaptation, and Non-Independent Features Improve Hidden Conditional Random Fields for Phone Classification
    Constantinos Boulis
    Christopher Manning
    Dan Jurafsky
    IEEE workshop on Automatic Speech Recognition and Understanding(2007)
    Preview abstract We show a number of improvements in the use of Hidden Conditional Random Fields (HCRFs) for phone classification on the TIMIT and Switchboard corpora. We first show that the use of regularization effectively prevents overfitting, im- proving over other methods such as early stopping. We then show that HCRFs are able to make use of non-independent features in phone classification, at least with small numbers of mixture components, while HMMs degrade due to their strong independence assumptions. Finally, we successfully apply Maximum a Posteriori adaptation to HCRFs, decreas- ing the phone classification error rate in the Switchboard cor- pus by around 1% – 5% given only small amounts of adapta- tion data. View details
    Detection of Word Fragments in Mandarin Telephone Conversation
    Cheng-Tao Chu
    Yuan Zhao
    Dan Jurafsky
    Interspeech(2006)
    Preview abstract We describe preliminary work on the detection of word fragments in Mandarin conversational telephone speech. We extracted prosodic, voice quality, and lexical features, and trained Decision Tree and SVM classifiers. Previous research shows that glottalization features are instrumental in English fragment detection. However, we show that Mandarin fragments are quite different than English; 90% of Mandarin fragments are followed immediately by a repetition of the fragmentary word. These repetition fragments are not glottalized, and they have a very specific distribution; the 12 most frequent words (“you”, “I”, “that”, “have”, “then”, etc.) cover 50% of the tokens of these fragments. Thus rather than glottalization, we found the most useful feature for Mandarin fragment detection was the identity of the neighboring character (word or morpheme). In an oracle experiment using the true (reference) neighboring words as well as prosodic and voice quality features, we achieved 80% accuracy in Mandarin fragment detection. View details