Jump to Content
Yong Cheng

Yong Cheng

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    Preview abstract We present Mu2SLAM, a multilingual sequence-to-sequence model pre-trained jointly on un-labeled speech, unlabeled text and supervised data spanning Automatic Speech Recognition(ASR), Automatic Speech Translation (AST)and Machine Translation (MT), in over 100 languages. By leveraging a quantized representation of speech as a target, Mu2SLAM trains ona sequence-to-sequence masked denoising objective similar to T5 on both unlabeled speech and text, while utilizing the supervised tasks to improve cross-lingual and cross-modal representation alignment within the model. On CoVoSTAST, Mu2SLAM establishes a new state-of-the-art for models trained on public datasets, improv-ing on xx-en translation over the previous best by 1.9 Bleu points and on en-xx translation by 0.9 Bleu points. On Voxpopuli ASR, our model matches the performance of a mSLAM model finetuned with a RNN-T decoder, despite using a relatively weaker sequence-to-sequence architecture. On text understanding tasks, our model improves by more than 6% over mSLAM on XNLI, getting closer to the performance of mT5 models of comparable capacity on XNLI and TydiQA, paving the way towards a single model for all speech and text understanding tasks. View details
    Preview abstract This paper introduces a Masked Generative Video Transformer, named MAGVIT, for multi-task video generation. We train a single MAGVIT model and apply it to multiple video generation tasks at inference time. To this end, two new designs are proposed: an improved 3D tokenizer model to quantize a video into spatial-temporal visual tokens, and a novel technique to embed conditions inside the mask to facilitate multi-task training. We conduct extensive experiments to demonstrate the compelling quality, efficiency, and flexibility of the proposed model. First, MAGVIT radically improves the previous best fidelity on two video generation tasks. In terms of efficiency, MAGVIT offers leading video generation speed at inference time, which is estimated to be one or two orders-of-magnitudes faster than other models. As for flexibility, we verified that a single trained MAGVIT is able to generically perform 8+ tasks at several video benchmarks from drastically different visual domains. We will open source our framework and models. View details
    VideoPoet: A Large Language Model for Zero-Shot Video Generation
    Lijun Yu
    Xiuye Gu
    Rachel Hornung
    Hassan Akbari
    Ming-Chang Chiu
    Josh Dillon
    Agrim Gupta
    Meera Hahn
    Anja Hauth
    David Hendon
    Alonso Martinez
    Grant Schindler
    Huisheng Wang
    Jimmy Yan
    Xuan Yang
    Lu Jiang
    arxiv Preprint (2023) (to appear)
    Preview abstract We present VideoPoet, a language model capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs -- including images, videos, text, and audio. The training protocol follows that of Large Language Models (LLMs), consisting of two stages: pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal generative objectives within an autoregressive Transformer framework. The pretrained LLM serves as a foundation that can be adapted for a range of video generation tasks. We present empirical results demonstrating the model's state-of-the-art capabilities in zero-shot video generation, specifically highlighting VideoPoet's ability to generate high-fidelity motions. Project page: http://sites.research.google/videopoet/ View details
    Preview abstract In this work, we introduce Semantic Pyramid AutoEncoder (SPAE) for enabling frozen LLMs to perform both understanding and generation tasks involving non-linguistic modalities such as images or videos. SPAE converts between raw pixels and interpretable lexical tokens (or words) extracted from the LLM's vocabulary. The resulting tokens capture both the semantic meaning and the fine-grained details needed for visual reconstruction, effectively translating the visual content into a language comprehensible to the LLM, and empowering it to perform a wide array of multimodal tasks. Our approach is validated through in-context learning experiments with frozen PaLM 2 and GPT 3.5 on a diverse set of image understanding and generation tasks. Our method marks the first successful attempt to enable a frozen LLM to generate image content while surpassing state-of-the-art performance in image understanding tasks, under the same setting, by over 25%. View details
    Preview abstract Multilingual neural machine translation (NMT) typically learns to maximize the likelihood of training examples from a combination set of multiple language pairs. However, this mechanical combination only relies on the basic sharing to learn the inductive bias, which undermines the generalization and transferability of multilingual NMT models. In this paper, we introduce a multilingual crossover encoder-decoder (mXEnDec) to fuse language pairs at instance level to exploit cross-lingual signals. For better fusions on multilingual data, we propose several techniques to deal with the language interpolation, dissimilar language fusion and heavy data imbalance. Experimental results on a large-scale WMT multilingual data set show that our approach significantly improves model performance on general multilingual test sets and the model transferability on zero-shot test sets (up to $+5.53$ BLEU). Results on noisy inputs demonstrates the capability of our approach to improve model robustness against the code-switching noise. We also conduct qualitative and quantitative representation comparisons to analyze the advantages of our approach at the representation level. View details
    Preview abstract Recently, self-supervised pre-training of text representations has been success-fully applied to low-resource Neural Machine Translation (NMT). However, it usually fails to achieve dramatic success on resource-rich NMT. In this paper, we propose a joint training approach, F2-XEnDec, to jointly self-supervised and supervised train NMT models. To this end, a new task called crossover encoder-decoder (XEnDec) is designed to entangle their representations. The key idea is to combine pseudo parallel sentences (also generated byXEnDec)) used in self-supervised training and parallel sentences in supervised training through a second crossover. Experiments on two resource-rich translation benchmarks, WMT’14English-German and English-French, demonstrate our approach achieve substantial improvements over the Transformer. We also show that our approach is capable of improving the model robustness against input perturbations, in particular for code-switched perturbations. View details
    Towards Web-based Etymological Hanzi learning
    Genze Wu
    Jia Xing
    Julia (Wenli) Zhu
    Jun Chen
    Kevin Jing
    Sijia Ma
    Wenhui Guo
    Yaolin Chen
    Yingying Zhao
    (2020)
    Preview abstract Modern-day Chinese characters, or Hanzi, originate from the ancient oracle-bone scripts (甲骨文). Such etymological relationship creates unique opportunities for Chinese literacy learning. This work proposes to use Web-based tools and the latest machine learning techniques to scale-up and enhance etymological Hanzi learning. By sharing our implementation details from launching an interactive sketch-based learning exhibition, we hope education-AI becomes more widely incorporated into today’s commercial Web applications. View details
    Preview abstract In this paper, we propose a new adversarial augmentation method for Neural Machine Translation (NMT). The main idea is to minimize the vicinal risk over virtual sentences sampled from two vicinity distributions, in which the crucial one is a novel vicinity distribution for adversarial sentences that describes a smooth interpolated embedding space centered around observed training sentence pairs. We then discuss our approach, AdvAug, to train NMT models using the embeddings of virtual sentences in sequence-tosequence learning. Experiments on ChineseEnglish, English-French, and English-German translation benchmarks show that AdvAug achieves significant improvements over the Transformer (up to 4.9 BLEU points), and substantially outperforms other data augmentation techniques (e.g. back-translation) without using extra corpora. View details
    Living Jiagu : Enabling Constructive Etymology for Chinese Learning
    Sijia Ma
    Jun Chen
    Wenhui Guo
    Yingying Zhao
    Yaolin Chen
    Kevin Jing
    Julia (Wenli) Zhu
    Extended Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems, ACM, pp. 1-4
    Preview abstract Living Jiagu is an interactive, wall-sized exhibition for the engaging learning of Chinese writing. Living Jiagu leverages state-of-the-art machine learning technologies to enable the recognition and recall of Chinese characters via constructive etymology in context. That is, learning the writing and meaning of a pictographic character from image prompts similar to the creators of Oracle Bone Script (OBS) 3000 years ago and experiencing how these characters function and interact in natural scene. An installation of Living Jiagu received positive feedback from over one thousand users. View details
    Preview abstract Neural machine translation (NMT) suffers from the vulnerability to noisy perturbations in the input, which can cause a model trained on the clean data to behave abnormally on the noisy input. We propose an approach to improving the robustness of NMT models, which consists of two parts: (1) attack the translation model with adversarial source examples; (2) defend the translation model with adversarial target input to be robust against adversarial source input. For the generation of adversarial input, we propose to use a gradient-based method to craft adversarial examples that are advised by the translation loss in NMT based on the clean input. Experimental results on Chinese-English and English-German translation tasks demonstrate that our approach achieves significant improvements on the standard clean data and performs robustness on the noisy data. View details
    An End-to-End Generative Architecture for Paraphrase Generation
    Qian Yang
    Zhouyuan Huo
    Dinghan Shen
    Wenlin Wang
    Guoyin Wang
    Lawrence Carin
    EMNLP (2019)
    Preview abstract Generating high-quality paraphrases is a fundamental yet challenging natural language processing task. Despite the effectiveness of previous work based on generative models, there remain problems with exposure bias in recurrent neural networks, and often a failure to generate realistic sentences. To overcome these challenges, we propose the first end-to-end conditional generative architecture for generating paraphrases via adversarial training, which does not depend on extra linguistic information. Extensive experiments on four public datasets demonstrate the proposed method achieves state-of-the-art results, outperforming previous generative architectures on both automatic metrics (BLEU, METEOR, and TER) and human evaluations. View details
    Preview abstract While neural machine translation (NMT) has achieved remarkable success, NMT systems are prone to make word omission errors. In this work, we propose a contrastive learning approach to reducing word omission errors in NMT. The basic idea is to enable the NMT model to assign a higher probability to a ground-truth translation and a lower probability to an erroneous translation, which is automatically constructed from the ground-truth translation by omitting words. We design different types of negative examples depending on the number of omitted words, word frequency, and part of speech. Experiments on Chinese-to-English, German-to-English, and Russian-to-English translation tasks show that our approach is effective in reducing word omission errors and achieves better translation performance than three baseline methods. View details
    No Results Found