Jump to Content
David J Fleet

David J Fleet

Research Areas

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    Preview abstract Text-guided image editing can have a transformative impact in supporting creative applications. A key challenge is to generate edits that are faithful to the input text prompt, while consistent with the input image. We present Imagen Editor, a cascaded diffusion model, built by fine-tuning Imagen on text-guided image inpainting. Imagen Editor's edits are faithful to the text prompts, which is accomplished by incorporating object detectors for proposing inpainting masks during training. In addition, text-guided image inpainting captures fine details in the input image by conditioning the cascaded pipeline on the original high resolution image. To improve qualitative and quantitative evaluation, we introduce EditBench, a systematic benchmark for text-guided image inpainting. EditBench evaluates inpainting edits on natural and generated images exploring objects, attributes, and scenes. Through extensive human evaluation on EditBench, we find that object-masking during training leads to across-the-board improvements in text-image alignment -- such that Imagen Editor is preferred over DALL-E 2 and Stable Diffusion -- and, as a cohort, these models are better at object-rendering than text-rendering, and handle material/color/size attributes better than count/shape attributes. View details
    Towards Generalist Biomedical AI
    Andrew Carroll
    Basil Mustafa
    Bradley Green
    Chuck Lau
    Danny Driess
    Ewa Dominowska
    Ira Ktena
    Joelle Barral
    Philip Mansfield
    Renee Wong
    Ryutaro Tanno
    Sara Mahdavi
    Simon Kornblith
    Sunny Virmani
    Sushant Prakash
    arxiv (2023)
    Preview abstract Medicine is inherently multimodal, with data spanning rich modalities including text, imaging, medical records, genomics, and more. Generalist biomedical AI systems that can flexibly encode, interpret, and integrate such data at scale can potentially enable impactful applications spanning care delivery and fundamental scientific discovery. We first curate MultiMedBench, a new multimodal biomedical benchmark to enable the development of such generalist models. MultiMedBench spans 14 diverse tasks including medical question answering, mammography and dermatology image interpretation, chest x-ray report generation and summarization and genomics variant calling. We then introduce Med-PaLM M (multimodal) - our proof of concept for such a generalist biomedical AI system. Med-PaLM M, built on top of PaLM-E, a large multimodal language model, flexibly encodes and interprets biomedical data spanning clinical language, medical imaging and genomics and can competently perform a diverse array of tasks with the same set of model weights. Med-PaLM M reaches performance near or exceeding the state-of-the-art (SOTA) on all tasks in MultiMedBench often beating task-specific narrow models by a wide margin. In addition, we also show qualitative examples of zero-shot learning, cross-task transfer learning and generalization with Med-PaLM M. In order to further probe the capabilities and limitations of Med-PaLM M, we propose an expert radiologist evaluation framework to characterize the quality of the chest x-ray radiology reports generated by the models. Under this framework, we observe encouraging quality of reports across model scales although remaining inferior to expert clinicians. Overall, while there are still several key limitations, we believe these results represent an important milestone towards the development of generalist biomedical AI systems. View details
    Robust and data-efficient generalization of self-supervised machine learning for diagnostic imaging
    Laura Anne Culp
    Jan Freyberg
    Basil Mustafa
    Sebastien Baur
    Simon Kornblith
    Ting Chen
    Patricia MacWilliams
    Sara Mahdavi
    Megan Zoë Walker
    Aaron Loh
    Cameron Chen
    Scott Mayer McKinney
    Zach William Beaver
    Fiona Keleher Ryan
    Mozziyar Etemadi
    Umesh Telang
    Lily Hao Yi Peng
    Geoffrey Everest Hinton
    Mohammad Norouzi
    Nature Biomedical Engineering (2023)
    Preview abstract Machine-learning models for medical tasks can match or surpass the performance of clinical experts. However, in settings differing from those of the training dataset, the performance of a model can deteriorate substantially. Here we report a representation-learning strategy for machine-learning models applied to medical-imaging tasks that mitigates such ‘out of distribution’ performance problem and that improves model robustness and training efficiency. The strategy, which we named REMEDIS (for ‘Robust and Efficient Medical Imaging with Self-supervision’), combines large-scale supervised transfer learning on natural images and intermediate contrastive self-supervised learning on medical images and requires minimal task-specific customization. We show the utility of REMEDIS in a range of diagnostic-imaging tasks covering six imaging domains and 15 test datasets, and by simulating three realistic out-of-distribution scenarios. REMEDIS improved in-distribution diagnostic accuracies up to 11.5% with respect to strong supervised baseline models, and in out-of-distribution settings required only 1–33% of the data for retraining to match the performance of supervised models retrained using all available data. REMEDIS may accelerate the development lifecycle of machine-learning models for medical imaging. View details
    A Generalist Framework for Panoptic Segmentation of Images and Videos
    Ting Chen
    Geoffrey Hinton
    International Conference on Computer Vision (ICCV) (2023)
    Preview abstract Panoptic segmentation assigns semantic and instance ID labels to every pixel of an image. As permutations of instance IDs are also valid solutions, the task requires learning of high-dimensional one-to-many mapping. As a result, state-of-the-art approaches use customized architectures and task-specific loss functions. We formulate panoptic segmentation as a discrete data generation problem, without relying on inductive bias of the task. A diffusion model is proposed to model panoptic masks, with a simple architecture and generic loss function. By simply adding past predictions as a conditioning signal, our method is capable of modeling video (in a streaming setting) and thereby learns to track object instances automatically. With extensive experiments, we demonstrate that our simple approach can perform competitively to state-of-the-art specialist methods in similar settings. View details
    Kubric: A scalable dataset generator
    Anissa Yuenming Mak
    Austin Stone
    Carl Doersch
    Cengiz Oztireli
    Charles Herrmann
    Daniel Rebain
    Derek Nowrouzezahrai
    Dmitry Lagun
    Fangcheng Zhong
    Florian Golemo
    Francois Belletti
    Henning Meyer
    Hsueh-Ti (Derek) Liu
    Issam Laradji
    Klaus Greff
    Kwang Moo Yi
    Matan Sela
    Noha Radwan
    Thomas Kipf
    Tianhao Wu
    Vincent Sitzmann
    Yilun Du
    Yishu Miao
    Preview abstract Data is the driving force of machine learning. The amount and quality of training data is often more important for the performance of a system than the details of its architecture. Data is also an important tool for testing specific hypothesis, and for empirically evaluating the behaviour of complex systems. Synthetic data generation represents a powerful tool that can address all these shortcomings: 1) it is cheap 2) supports rich ground-truth annotations 3) offers full control over data and 4) can circumvent privacy and legal concerns. Unfortunately the toolchain for generating data is less well developed than that for building models. We aim to improve this situation by introducing Kubric: a scalable open-source pipeline for generating realistic image and video data with rich ground truth annotations. We also publish a collection of generated datasets and baseline results on several vision tasks. View details
    Palette: Image-to-Image Diffusion Models
    Chitwan Saharia
    Chris A. Lee
    Huiwen Chang
    Jonathan Ho
    Mohammad Norouzi
    William Chan
    SIGGRAPH 2022 (2022)
    Preview abstract This paper develops a unified framework for image-to-image translation based on conditional diffusion models and evaluates this framework on four challenging image-to-image translation tasks, namely colorization, inpainting, uncropping, and JPEG restoration. Our simple implementation of image-to-image diffusion models outperforms strong GAN and regression baselines on all tasks, without task-specific hyper-parameter tuning, architecture customization, or any auxiliary loss or sophisticated new techniques needed. We uncover the impact of an L2 vs. L1 loss in the denoising diffusion objective on sample diversity, and demonstrate the importance of self-attention in the neural architecture through empirical studies. Importantly, we advocate a unified evaluation protocol based on ImageNet, with human evaluation and sample quality scores (FID, Inception Score, Classification Accuracy of a pre-trained ResNet-50, and Perceptual Distance against original images). We expect this standardized evaluation protocol to play a role in advancing image-to-image translation research. Finally, we show that a generalist, multi-task diffusion model performs as well or better than task-specific specialist counterparts. Check out https://diffusion-palette.github.io/ for an overview of the results. View details
    A Unified Sequence Interface for Vision Tasks
    Ting Chen
    Tsung-Yi Lin
    Geoffrey Hinton
    Advances in Neural Information Processing Systems (NeurIPS) (2022)
    Preview abstract While language tasks are naturally expressed in a single, unified, modeling framework, i.e., generating sequences of tokens, this has not been the case in computer vision. As a result, there is a proliferation of distinct architectures and loss functions for different vision tasks. In this work we show that a diverse set of "core" computer vision tasks can also be unified if formulated in terms of a shared pixel-to-sequence interface. We focus on four tasks, namely, object detection, instance segmentation, keypoint detection, and image captioning, all with diverse types of outputs, e.g., bounding boxes or dense masks. Despite that, by formulating the output of each task as a sequence of discrete tokens with a unified interface, we show that one can train a neural network with a single model architecture and loss function on all these tasks, with no task-specific customization. To solve a specific task, we use a short prompt as task description, and the sequence output adapts to the prompt so it can produce task-specific output. We show that such a model can achieve competitive performance compared to well-established task-specific models. View details
    Pix2seq: A Language Modeling Framework for Object Detection
    Ting Chen
    Geoffrey Everest Hinton
    International Conference on Learning Representations (2022)
    Preview abstract We present Pix2Seq, a simple and generic framework for object detection. Unlike existing approaches that explicitly integrate prior knowledge about the task, we cast object detection as a language modeling task conditioned on the observed pixel inputs. Object descriptions (e.g., bounding boxes and class labels) are expressed as sequences of discrete tokens, and we train a neural network to perceive the image and generate the desired sequence. Our approach is based mainly on the intuition that if a neural network knows about where and what the objects are, we just need to teach it how to read them out. Beyond the use of task-specific data augmentations, our approach makes minimal assumptions about the task, yet it achieves competitive results on the challenging COCO dataset, compared to highly specialized and well optimized detection algorithms. View details
    Image Super-Resolution via Iterative Refinement
    Chitwan Saharia
    Jonathan Ho
    Mohammad Norouzi
    William Chan
    Submission to ICCV 2021
    Preview abstract We present SR3, an approach to image Super-Resolution via Repeated Refinement. SR3 adapts denoising diffusion probabilistic models to conditional image generation and performs super-resolution through a stochastic denoising process. Inference starts with pure Gaussian noise and iteratively refines the noisy output using a U-Net model trained on denoising at various noise levels. SR3 exhibits strong performance on super-resolution tasks at different magnification factors, on faces and natural images. We conduct human evaluation on a standard 8× face super-resolution task on CelebA-HQ, comparing with SOTA GAN methods. SR3 achieves a fool rate close to 50%, suggesting photo-realistic outputs, while GAN baselines do not exceed a fool rate of 34%. We further show the effectiveness of SR3 in cascaded image generation, where generative models are chained with super-resolution models, yielding competitive FID scores on ImageNet. View details
    Cascaded Diffusion Models for High Fidelity Image Generation
    Jonathan Ho
    Chitwan Saharia
    William Chan
    Mohammad Norouzi
    https://cascaded-diffusion.github.io/ (2021)
    Preview abstract We show that cascaded diffusion models are capable of generating high fidelity images on the class-conditional ImageNet generation challenge, without any assistance from auxiliary image classifiers to boost sample quality. A cascaded diffusion model comprises a pipeline of multiple diffusion models that generate images of increasing resolution, beginning with a standard diffusion model at the lowest resolution, followed by one or more super-resolution diffusion models that successively upsample the image and add higher resolution details. We find that the sample quality of a cascading pipeline relies crucially on conditioning augmentation, our proposed method of data augmentation of the lower resolution conditioning inputs to the super-resolution models. Our experiments show that conditioning augmentation prevents compounding error during sampling in a cascaded model, helping us to train cascading pipelines achieving FID scores of 1.48 at 64x64, 3.52 at 128x128 and 4.88 at 256x256 resolutions, outperforming BigGAN-deep, and classification accuracy scores of 63.02% (top-1) and 84.06% (top-5) at 256x256, outperforming VQ-VAE-2. View details
    No Results Found