Yong Cheng
Research Areas
Authored Publications
Sort By
VideoPoet: A Large Language Model for Zero-Shot Video Generation
Dan Kondratyuk
Xiuye Gu
Jonathan Huang
Grant Schindler
Rachel Hornung
Vighnesh Birodkar
Jimmy Yan
Ming-Chang Chiu
Hassan Akbari
Josh Dillon
Agrim Gupta
Meera Hahn
Anja Hauth
David Hendon
Alonso Martinez
Kihyuk Sohn
Xuan Yang
Huisheng Wang
Lu Jiang
ICML (2024)
Preview abstract
We present VideoPoet, a language model capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs -- including images, videos, text, and audio. The training protocol follows that of Large Language Models (LLMs), consisting of two stages: pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal generative objectives within an autoregressive Transformer framework. The pretrained LLM serves as a foundation that can be adapted for a range of video generation tasks. We present empirical results demonstrating the model's state-of-the-art capabilities in zero-shot video generation, specifically highlighting VideoPoet's ability to generate high-fidelity motions. Project page: http://sites.research.google/videopoet/
View details
Towards Conversational Diagnostic AI
Anil Palepu
Khaled Saab
Jan Freyberg
Ryutaro Tanno
Amy Wang
Brenna Li
Nenad Tomašev
Karan Singhal
Le Hou
Albert Webson
Kavita Kulkarni
Sara Mahdavi
Juro Gottweis
Joelle Barral
Kat Chou
Arxiv (2024) (to appear)
Preview abstract
At the heart of medicine lies the physician-patient dialogue, where skillful history-taking paves the way for accurate diagnosis, effective management, and enduring trust. Artificial Intelligence (AI) systems capable of diagnostic dialogue could increase accessibility, consistency, and quality of care. However, approximating clinicians' expertise is an outstanding grand challenge. Here, we introduce AMIE (Articulate Medical Intelligence Explorer), a Large Language Model (LLM) based AI system optimized for diagnostic dialogue.
AMIE uses a novel self-play based simulated environment with automated feedback mechanisms for scaling learning across diverse disease conditions, specialties, and contexts. We designed a framework for evaluating clinically-meaningful axes of performance including history-taking, diagnostic accuracy, management reasoning, communication skills, and empathy. We compared AMIE's performance to that of primary care physicians (PCPs) in a randomized, double-blind crossover study of text-based consultations with validated patient actors in the style of an Objective Structured Clinical Examination (OSCE). The study included 149 case scenarios from clinical providers in Canada, the UK, and India, 20 PCPs for comparison with AMIE, and evaluations by specialist physicians and patient actors. AMIE demonstrated greater diagnostic accuracy and superior performance on 28 of 32 axes according to specialist physicians and 24 of 26 axes according to patient actors. Our research has several limitations and should be interpreted with appropriate caution. Clinicians were limited to unfamiliar synchronous text-chat which permits large-scale LLM-patient interactions but is not representative of usual clinical practice. While further research is required before AMIE could be translated to real-world settings, the results represent a milestone towards conversational diagnostic AI.
View details
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
Nitesh Bharadwaj Gundavarapu
Luca Versari
Kihyuk Sohn
Agrim Gupta
Xiuye Gu
Alex Hauptmann
Boqing Gong
Lu Jiang
ICLR (2024)
Preview abstract
While Large Language Models (LLMs) are the dominant models for generative tasks in language, they do not perform as well as diffusion models on image and video generation. To effectively use LLMs for visual generation, one crucial component is the visual tokenizer that maps pixel-space inputs to discrete tokens appropriate for LLM learning. In this paper, we introduce MAGVIT-v2, a video tokenizer designed to generate concise and expressive tokens for both videos and images using a common token vocabulary. Equipped with this new tokenizer, we show that LLMs outperform diffusion models on standard image and video generation benchmarks including ImageNet and Kinetics. In addition, we demonstrate that our tokenizer surpasses the previously top-performing video tokenizer on two more tasks: (1) video compression comparable to the next-generation video codec (VCC) according to human evaluations, and (2) learning effective representations for action recognition tasks.
View details
SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs
Zhiruo Wang
Yonatan Bisk
Alex Hauptmann
Lu Jiang
NeurIPS (2023)
Preview abstract
In this work, we introduce Semantic Pyramid AutoEncoder (SPAE) for enabling frozen LLMs to perform both understanding and generation tasks involving non-linguistic modalities such as images or videos. SPAE converts between raw pixels and interpretable lexical tokens (or words) extracted from the LLM's vocabulary. The resulting tokens capture both the semantic meaning and the fine-grained details needed for visual reconstruction, effectively translating the visual content into a language comprehensible to the LLM, and empowering it to perform a wide array of multimodal tasks. Our approach is validated through in-context learning experiments with frozen PaLM 2 and GPT 3.5 on a diverse set of image understanding and generation tasks. Our method marks the first successful attempt to enable a frozen LLM to generate image content while surpassing state-of-the-art performance in image understanding tasks, under the same setting, by over 25%.
View details
Towards Accurate Differential Diagnosis with Large Language Models
Daniel McDuff
Anil Palepu
Amy Wang
Karan Singhal
Yash Sharma
Kavita Kulkarni
Le Hou
Sara Mahdavi
Sushant Prakash
Anupam Pathak
Shwetak Patel
Ewa Dominowska
Juro Gottweis
Joelle Barral
Kat Chou
Jake Sunshine
Arxiv (2023)
Preview abstract
An accurate differential diagnosis (DDx) is a cornerstone of medical care, often reached through an iterative process of interpretation that combines clinical history, physical examination, investigations and procedures. Interactive interfaces powered by Large Language Models (LLMs) present new opportunities to both assist and automate aspects of this process. In this study, we introduce an LLM optimized for diagnostic reasoning, and evaluate its ability to generate a DDx alone or as an aid to clinicians. 20 clinicians evaluated 302 challenging, real-world medical cases sourced from the New England Journal of Medicine (NEJM) case reports. Each case report was read by two clinicians, who were randomized to one of two assistive conditions: either assistance from search engines and standard medical resources, or LLM assistance in addition to these tools. All clinicians provided a baseline, unassisted DDx prior to using the respective assistive tools. Our LLM for DDx exhibited standalone performance that exceeded that of unassisted clinicians (top-10 accuracy 59.1% vs 33.6%, [p = 0.04]). Comparing the two assisted study arms, the DDx quality score was higher for clinicians assisted by our LLM (top-10 accuracy 51.7%) compared to clinicians without its assistance (36.1%) (McNemar's Test: 45.7, p < 0.01) and clinicians with search (44.4%) (4.75, p = 0.03). Further, clinicians assisted by our LLM arrived at more comprehensive differential lists than those without its assistance. Our study suggests that our LLM for DDx has potential to improve clinicians' diagnostic reasoning and accuracy in challenging cases, meriting further real-world evaluation for its ability to empower physicians and widen patients' access to specialist-level expertise.
View details
MAGVIT: Masked Generative Video Transformer
Kihyuk Sohn
Han Zhang
Huiwen Chang
Alex Hauptmann
Lu Jiang
CVPR (2023)
Preview abstract
This paper introduces a Masked Generative Video Transformer, named MAGVIT, for multi-task video generation. We train a single MAGVIT model and apply it to multiple video generation tasks at inference time. To this end, two new designs are proposed: an improved 3D tokenizer model to quantize a video into spatial-temporal visual tokens, and a novel technique to embed conditions inside the mask to facilitate multi-task training.
We conduct extensive experiments to demonstrate the compelling quality, efficiency, and flexibility of the proposed model.
First, MAGVIT radically improves the previous best fidelity on two video generation tasks.
In terms of efficiency, MAGVIT offers leading video generation speed at inference time, which is estimated to be one or two orders-of-magnitudes faster than other models. As for flexibility, we verified that a single trained MAGVIT is able to generically perform 8+ tasks at several video benchmarks from drastically different visual domains. We will open source our framework and models.
View details
Preview abstract
We present Mu2SLAM, a multilingual sequence-to-sequence model pre-trained jointly on un-labeled speech, unlabeled text and supervised data spanning Automatic Speech Recognition(ASR), Automatic Speech Translation (AST)and Machine Translation (MT), in over 100 languages. By leveraging a quantized representation of speech as a target, Mu2SLAM trains ona sequence-to-sequence masked denoising objective similar to T5 on both unlabeled speech and text, while utilizing the supervised tasks to improve cross-lingual and cross-modal representation alignment within the model. On CoVoSTAST, Mu2SLAM establishes a new state-of-the-art for models trained on public datasets, improv-ing on xx-en translation over the previous best by 1.9 Bleu points and on en-xx translation by 0.9 Bleu points. On Voxpopuli ASR, our model matches the performance of a mSLAM model finetuned with a RNN-T decoder, despite using a relatively weaker sequence-to-sequence architecture. On text understanding tasks, our model improves by more than 6% over mSLAM on XNLI, getting closer to the performance of mT5 models of comparable capacity on XNLI and TydiQA, paving the way towards a single model for all speech and text understanding tasks.
View details
Preview abstract
Multilingual neural machine translation (NMT) typically learns to maximize the likelihood of training examples from a combination set of multiple language pairs. However, this mechanical combination only relies on the basic sharing to learn the inductive bias, which undermines the generalization and transferability of multilingual NMT models. In this paper, we introduce a multilingual crossover encoder-decoder (mXEnDec) to fuse language pairs at instance level to exploit cross-lingual signals. For better fusions on multilingual data, we propose several techniques to deal with the language interpolation, dissimilar language fusion and heavy data imbalance. Experimental results on a large-scale WMT multilingual data set show that our approach significantly improves model performance on general multilingual test sets and the model transferability on zero-shot test sets (up to $+5.53$ BLEU).
Results on noisy inputs demonstrates the capability of our approach to improve model robustness against the code-switching noise. We also conduct qualitative and quantitative representation comparisons to analyze the advantages of our approach at the representation level.
View details
Preview abstract
Recently, self-supervised pre-training of text representations has been success-fully applied to low-resource Neural Machine Translation (NMT). However, it usually fails to achieve dramatic success on resource-rich NMT. In this paper, we propose a joint training approach, F2-XEnDec, to jointly self-supervised and supervised train NMT models. To this end, a new task called crossover encoder-decoder (XEnDec) is designed to entangle their representations. The key idea is to combine pseudo parallel sentences (also generated byXEnDec)) used in self-supervised training and parallel sentences in supervised training through a second crossover. Experiments on two resource-rich translation benchmarks, WMT’14English-German and English-French, demonstrate our approach achieve substantial improvements over the Transformer. We also show that our approach is capable of improving the model robustness against input perturbations, in particular for code-switched perturbations.
View details
Towards Web-based Etymological Hanzi learning
Genze Wu
Jia Xing
Julia (Wenli) Zhu
Jun Chen
Kevin Jing
Sijia Ma
Wenhui Guo
Yaolin Chen
Yingying Zhao
(2020)
Preview abstract
Modern-day Chinese characters, or Hanzi, originate from the ancient oracle-bone scripts (甲骨文). Such etymological relationship creates unique opportunities for Chinese literacy learning. This work proposes to use Web-based tools and the latest machine learning techniques to scale-up and enhance etymological Hanzi learning. By sharing our implementation details from launching an interactive sketch-based learning exhibition, we hope education-AI becomes more widely incorporated into today’s commercial Web applications.
View details