Nan Ding
Research Areas
Authored Publications
Sort By
PaLI: A Jointly-Scaled Multilingual Language-Image Model
Piotr Padlewski
Daniel Salz
Sebastian Alexander Goodman
Basil Mustafa
Keran Rong
Hassan Akbari
Linting Xue
James Bradbury
Chao Jia
Carlos Riquelme
Neil Houlsby
International Conference on Learning Representations (ICLR) (2023)
Preview abstract
Effective scaling and a flexible task interface enable large-capacity language models to excel at many tasks. PaLI (Pathways Language and Image model) extends these ideas to the joint modeling of language and vision. PaLI is a model that generates text based on visual and textual inputs. Using this API, PaLI is able to perform many vision, language, and multimodal tasks, across many languages. We train PaLI with two main principles: reuse of pretrained unimodal components, and joint scaling of modalities. Using large-capacity pretrained language models and vision models allows us to capitalize on their existing capabilities, while leveraging the substantial cost of training them. We scale PaLI models across three axes:the language component, the vision component, and the training data that fuses them. For the vision component, we train the largest and best-performing VisionTransformer (ViT) to date. For the data, we build an image-text training set over10B images and covering over 100 languages.
PaLI inherits and enhances language-understanding capabilities, and achieves state-of-the-art in multiple vision and language tasks (image classification, image captioning, visual question-answering, scene-text understanding, etc.), based on a simple, modular, and reuse-friendly platform for modeling and scaling.
View details
Preview abstract
Visual Question Answering (VQA) has benefited from increasingly sophisticated models, but has not enjoyed the same level of engagement in terms of data creation. In this paper, we propose a method that automatically derives VQA examples at volume, by leveraging the abundance of existing image-caption annotations combined with neural models for textual question generation. We show that the resulting data is of high-quality. VQA models trained on our data improve state-of-the-art zero-shot accuracy by double digits and achieve a level of robustness that lacks in the same model trained on human-annotated VQA data.
View details
Preview abstract
The availability of large-scale image captioning and visual question answering datasets has contributed significantly to recent successes in vision-and-language pre-training. However, these datasets are often collected with overrestrictive requirements, inherited from their original target tasks (e.g., image caption generation), which limit the resulting dataset scale and diversity. We take a step further in pushing the limits of vision-and-language pre-training data by relaxing the data collection pipeline used in Conceptual Captions 3M (CC3M) [Sharma et al. 2018] and introduce the Conceptual 12M (CC12M), a dataset with 12 million image-text pairs specifically meant to be used for vision-and-language pre-training. We perform an analysis of this dataset, as well as benchmark its effectiveness against CC3M on multiple downstream tasks with an emphasis on long-tail visual recognition. The quantitative and qualitative results clearly illustrate the benefit of scaling up pre-training data for vision-and-language tasks, as indicated by the new state-of-the-art results on both the nocaps and Conceptual Captions benchmarks.
View details
Bridging the Gap Between Practice and PAC-Bayes Theory in Few-shot Meta-learning
Sebastian Alexander Goodman
Advances in Neural Information Processing Systems 2021
Preview abstract
Despite recent advances in its theoretical understanding, there still remains a significant gap in the ability of existing meta-learning theorems to explain the performance improvements in the few-shot learning setting, where the number of samples in the target tasks is severely limited.
This gap originates from an assumption in the existing theories which supposes that the number of samples in the observed tasks and the number of samples in the target tasks follow the same distribution, an assumption that rarely holds in practice.
By relaxing this assumption we develop two PAC-Bayesian bounds tailored for the few-shot learning setting and show that two existing meta-learning algorithms (MAML and Reptile) can be derived from our bounds, thereby bridging the gap between practice and PAC-Bayesian theorems.
Furthermore, we derive a new computationally efficient PAC-Bayesian algorithm, and show it outperforms existing meta-learning algorithms on several few-shot benchmark datasets.
View details
Preview abstract
This paper introduces TeaForN, an extension of the teacher-forcing method to N-grams.
Sequence generation models trained with teacher-forcing suffer from problems such as exposure bias and lack of differentiability across timesteps.
TeaForN addresses both these problems directly, through the use of a stack of N decoders trained to decode along a secondary time axis that allows model-parameter updates based on N prediction steps.
Unlike other approaches, TeaForN can be used with a wide class of decoder architectures and requires minimal modifications from a standard teacher-forcing setup.
Empirically, we show that TeaForN boosts model quality and beam-efficiency against several sequence generation benchmarks.
View details
Preview abstract
Supervised training of abstractive language generation models results in learning conditional probabilities over language sequences based on the supervised training signal. When the training signal contains a variety of writing styles, such models may end up learning an 'average' style that is directly influenced by the training data make-up and cannot be controlled by the needs of an application. We describe a family of model architectures capable of capturing both generic language characteristics via shared model parameters, as well as particular style characteristics via private model parameters. Such models are able to generate language according to a specific learned style, while still taking advantage of their power to model generic language phenomena. Furthermore, we describe an extension that uses a mixture of output distributions from all learned styles to perform on-the-fly style adaptation based on the textual input alone. Experimentally, we find that the proposed models consistently outperform models that encapsulate single-style or average-style language generation capabilities.
View details
Preview abstract
We present a new dataset of image caption annotations, CHIA, which contains an order of magnitude more images than the MS-COCO dataset and represents a wider variety of both image and image
caption styles. We achieve this by extracting and filtering image caption annotations from billions of Internet webpages. We also present quantitative evaluations of a number of image captioning models
and show that a model architecture based on Inception-ResNet-v2 CNN for image-feature extraction and Transformer for sequence modeling achieves best performance when trained on the CHIA dataset.
We present a new dataset of image caption annotations, Conceptual Captions, which contains an order of magnitude more images than the MS-COCO dataset and represents a wider variety of both images and image caption styles. We achieve this by extracting and filtering image caption annotations from billions of webpages. We also present quantitative evaluations of a number of image captioning models and show that a model architecture based on Inception-ResNet-v2 for image-feature extraction and Transformer for sequence modeling achieves the best performance when trained on the Conceptual Captions dataset.
View details
Characterizing Quantum Supremacy in Near-Term Devices
Sergei Isakov
Vadim Smelyanskiy
Michael J. Bremner
John Martinis
Nature Physics, 14 (2018), 595–600
Preview abstract
A critical question for quantum computing in the near future is whether quantum devices without error correction can perform a well-defined computational task beyond the capabilities of supercomputers. Such a demonstration of what is referred to as quantum supremacy requires a reliable evaluation of the resources required to solve tasks with classical approaches. Here, we propose the task of sampling from the output distribution of random quantum circuits as a demonstration of quantum supremacy. We extend previous results in computational complexity to argue that this sampling task must take exponential time in a classical computer. We introduce cross-entropy benchmarking to obtain the experimental fidelity of complex multiqubit dynamics. This can be estimated and extrapolated to give a success metric for a quantum supremacy demonstration. We study the computational cost of relevant classical algorithms and conclude that quantum supremacy can be achieved with circuits in a two-dimensional lattice of 7 × 7 qubits and around 40 clock cycles. This requires an error rate of around 0.5% for two-qubit gates (0.05% for one-qubit gates), and it would demonstrate the basic building blocks for a fault-tolerant quantum computer
View details
Preview abstract
Policy-gradient approaches to reinforcement learning have two common and
undesirable overhead procedures, namely warm-start training and sample variance
reduction. In this paper, we describe a reinforcement learning method based on
a softmax policy that requires neither of these procedures. Our method combines
the advantages of policy-gradient methods with the efficiency and simplicity of
maximum-likelihood approaches. We apply this new cold-start reinforcement
learning method in training sequence generation models for structured output
prediction problems. Empirical evidence validates this method on automatic
summarization and image captioning tasks.
View details
Preview abstract
We present a dual contribution to the task of machine reading-comprehension: a technique for creating large-sized machine-comprehension (MC) datasets using paragraph-vector models; and a novel, hybrid neural-network architecture that combines the representation power of recursive neural networks with the discriminative power of fully-connected multi-layered networks. We use the MC-dataset generation technique to build a dataset of around 2 million examples, for which we empirically determine the high-ceiling of human performance (around 91\% accuracy), as well as the performance of a variety of computer models. Among all the models we have experimented with, our hybrid neural-network architecture achieves the highest performance (83.2\% accuracy). The remaining gap to the human-performance ceiling provides enough room for future model improvements.
View details