Nan Ding
Research Areas
Authored Publications
Google Publications
Other Publications
Sort By
PaLI: A Jointly-Scaled Multilingual Language-Image Model
Piotr Padlewski
Daniel Salz
Sebastian Alexander Goodman
Basil Mustafa
Keran Rong
Hassan Akbari
Linting Xue
James Bradbury
Carlos Riquelme
International Conference on Learning Representations (ICLR) (2023)
Preview abstract
Effective scaling and a flexible task interface enable large-capacity language models to excel at many tasks. PaLI (Pathways Language and Image model) extends these ideas to the joint modeling of language and vision. PaLI is a model that generates text based on visual and textual inputs. Using this API, PaLI is able to perform many vision, language, and multimodal tasks, across many languages. We train PaLI with two main principles: reuse of pretrained unimodal components, and joint scaling of modalities. Using large-capacity pretrained language models and vision models allows us to capitalize on their existing capabilities, while leveraging the substantial cost of training them. We scale PaLI models across three axes:the language component, the vision component, and the training data that fuses them. For the vision component, we train the largest and best-performing VisionTransformer (ViT) to date. For the data, we build an image-text training set over10B images and covering over 100 languages.
PaLI inherits and enhances language-understanding capabilities, and achieves state-of-the-art in multiple vision and language tasks (image classification, image captioning, visual question-answering, scene-text understanding, etc.), based on a simple, modular, and reuse-friendly platform for modeling and scaling.
View details
Preview abstract
Visual Question Answering (VQA) has benefited from increasingly sophisticated models, but has not enjoyed the same level of engagement in terms of data creation. In this paper, we propose a method that automatically derives VQA examples at volume, by leveraging the abundance of existing image-caption annotations combined with neural models for textual question generation. We show that the resulting data is of high-quality. VQA models trained on our data improve state-of-the-art zero-shot accuracy by double digits and achieve a level of robustness that lacks in the same model trained on human-annotated VQA data.
View details
Bridging the Gap Between Practice and PAC-Bayes Theory in Few-shot Meta-learning
Sebastian Alexander Goodman
Advances in Neural Information Processing Systems 2021
Preview abstract
Despite recent advances in its theoretical understanding, there still remains a significant gap in the ability of existing meta-learning theorems to explain the performance improvements in the few-shot learning setting, where the number of samples in the target tasks is severely limited.
This gap originates from an assumption in the existing theories which supposes that the number of samples in the observed tasks and the number of samples in the target tasks follow the same distribution, an assumption that rarely holds in practice.
By relaxing this assumption we develop two PAC-Bayesian bounds tailored for the few-shot learning setting and show that two existing meta-learning algorithms (MAML and Reptile) can be derived from our bounds, thereby bridging the gap between practice and PAC-Bayesian theorems.
Furthermore, we derive a new computationally efficient PAC-Bayesian algorithm, and show it outperforms existing meta-learning algorithms on several few-shot benchmark datasets.
View details
Preview abstract
The availability of large-scale image captioning and visual question answering datasets has contributed significantly to recent successes in vision-and-language pre-training. However, these datasets are often collected with overrestrictive requirements, inherited from their original target tasks (e.g., image caption generation), which limit the resulting dataset scale and diversity. We take a step further in pushing the limits of vision-and-language pre-training data by relaxing the data collection pipeline used in Conceptual Captions 3M (CC3M) [Sharma et al. 2018] and introduce the Conceptual 12M (CC12M), a dataset with 12 million image-text pairs specifically meant to be used for vision-and-language pre-training. We perform an analysis of this dataset, as well as benchmark its effectiveness against CC3M on multiple downstream tasks with an emphasis on long-tail visual recognition. The quantitative and qualitative results clearly illustrate the benefit of scaling up pre-training data for vision-and-language tasks, as indicated by the new state-of-the-art results on both the nocaps and Conceptual Captions benchmarks.
View details
Preview abstract
This paper introduces TeaForN, an extension of the teacher-forcing method to N-grams.
Sequence generation models trained with teacher-forcing suffer from problems such as exposure bias and lack of differentiability across timesteps.
TeaForN addresses both these problems directly, through the use of a stack of N decoders trained to decode along a secondary time axis that allows model-parameter updates based on N prediction steps.
Unlike other approaches, TeaForN can be used with a wide class of decoder architectures and requires minimal modifications from a standard teacher-forcing setup.
Empirically, we show that TeaForN boosts model quality and beam-efficiency against several sequence generation benchmarks.
View details
Preview abstract
We present a new dataset of image caption annotations, CHIA, which contains an order of magnitude more images than the MS-COCO dataset and represents a wider variety of both image and image
caption styles. We achieve this by extracting and filtering image caption annotations from billions of Internet webpages. We also present quantitative evaluations of a number of image captioning models
and show that a model architecture based on Inception-ResNet-v2 CNN for image-feature extraction and Transformer for sequence modeling achieves best performance when trained on the CHIA dataset.
We present a new dataset of image caption annotations, Conceptual Captions, which contains an order of magnitude more images than the MS-COCO dataset and represents a wider variety of both images and image caption styles. We achieve this by extracting and filtering image caption annotations from billions of webpages. We also present quantitative evaluations of a number of image captioning models and show that a model architecture based on Inception-ResNet-v2 for image-feature extraction and Transformer for sequence modeling achieves the best performance when trained on the Conceptual Captions dataset.
View details
Preview abstract
Supervised training of abstractive language generation models results in learning conditional probabilities over language sequences based on the supervised training signal. When the training signal contains a variety of writing styles, such models may end up learning an 'average' style that is directly influenced by the training data make-up and cannot be controlled by the needs of an application. We describe a family of model architectures capable of capturing both generic language characteristics via shared model parameters, as well as particular style characteristics via private model parameters. Such models are able to generate language according to a specific learned style, while still taking advantage of their power to model generic language phenomena. Furthermore, we describe an extension that uses a mixture of output distributions from all learned styles to perform on-the-fly style adaptation based on the textual input alone. Experimentally, we find that the proposed models consistently outperform models that encapsulate single-style or average-style language generation capabilities.
View details
Characterizing Quantum Supremacy in Near-Term Devices
Sergei Isakov
Vadim Smelyanskiy
Michael J. Bremner
John Martinis
Nature Physics, vol. 14 (2018), 595–600
Preview abstract
A critical question for quantum computing in the near future is whether quantum devices without error correction can perform a well-defined computational task beyond the capabilities of supercomputers. Such a demonstration of what is referred to as quantum supremacy requires a reliable evaluation of the resources required to solve tasks with classical approaches. Here, we propose the task of sampling from the output distribution of random quantum circuits as a demonstration of quantum supremacy. We extend previous results in computational complexity to argue that this sampling task must take exponential time in a classical computer. We introduce cross-entropy benchmarking to obtain the experimental fidelity of complex multiqubit dynamics. This can be estimated and extrapolated to give a success metric for a quantum supremacy demonstration. We study the computational cost of relevant classical algorithms and conclude that quantum supremacy can be achieved with circuits in a two-dimensional lattice of 7 × 7 qubits and around 40 clock cycles. This requires an error rate of around 0.5% for two-qubit gates (0.05% for one-qubit gates), and it would demonstrate the basic building blocks for a fault-tolerant quantum computer
View details
Preview abstract
Policy-gradient approaches to reinforcement learning have two common and
undesirable overhead procedures, namely warm-start training and sample variance
reduction. In this paper, we describe a reinforcement learning method based on
a softmax policy that requires neither of these procedures. Our method combines
the advantages of policy-gradient methods with the efficiency and simplicity of
maximum-likelihood approaches. We apply this new cold-start reinforcement
learning method in training sequence generation models for structured output
prediction problems. Empirical evidence validates this method on automatic
summarization and image captioning tasks.
View details
What is the Computational Value of Finite Range Tunneling?
Sergei Isakov
Vadim Smelyanskiy
John Martinis
Physical Review X, vol. 6 (2016), pp. 031015
Preview abstract
Quantum annealing (QA) has been proposed as a quantum enhanced optimization heuristic exploiting tunneling. Here, we demonstrate how finite-range tunneling can provide considerable computational advantage. For a crafted problem designed to have tall and narrow energy barriers separating local minima, the D-Wave 2X quantum annealer achieves significant runtime advantages relative to simulated annealing (SA). For instances with 945 variables, this results in a time-to-99%-success-probability that is ~1e8 times faster than SA running on a single processor core. We also compare physical QA with the quantum Monte Carlo algorithm, an algorithm that emulates quantum tunneling on classical processors. We observe a substantial constant overhead against physical QA: D-Wave 2X again runs up to ~ 1e8 times faster than an optimized implementation of the quantum Monte Carlo algorithm on a single core. We note that there exist heuristic classical algorithms that can solve most instances of Chimera structured problems in a time scale comparable to the D-Wave 2X. However, it is well known that such solvers will become ineffective for sufficiently dense connectivity graphs. To investigate whether finite-range tunneling will also confer an advantage for problems of practical interest, we conduct numerical studies on binary optimization problems that cannot yet be represented on quantum hardware. For random instances of the number partitioning problem, we find numerically that algorithms designed to simulate QA scale better than SA. We discuss the implications of these findings for the design of next-generation quantum annealers.
View details
Scalable Quantum Simulation of Molecular Energies
Ian Kivlichan
Jonathan Romero
Rami Barends
Andrew Tranter
Brooks Campbell
Yu Chen
Zijun Chen
Ben Chiaro
Andrew Dunsworth
Anthony Megrant
Josh Mutus
Charles Neil
Jim Wenner
Amit Vainsencher
Peter Coveney
Peter Love
Alán Aspuru-Guzik
John Martinis
Physical Review X, vol. 6 (2016), pp. 031007
Preview abstract
We report the first electronic structure calculation performed on a quantum computer without
exponentially costly precompilation. We use a programmable array of superconducting qubits to compute the energy surface of molecular hydrogen using two distinct quantum algorithms. First, we experimentally execute the unitary coupled cluster method using the variational quantum eigensolver. Our efficient implementation predicts the correct dissociation energy to within chemical accuracy of the numerically exact result. Second, we experimentally demonstrate the canonical quantum algorithm for chemistry, which consists of Trotterization and quantum phase estimation. We compare the experimental performance of these approaches to show clear evidence that the variational quantum eigensolver is robust to certain errors. This error tolerance inspires hope that variational quantum simulations of classically intractable molecules may be viable in the near future.
View details
Preview abstract
We present a dual contribution to the task of machine reading-comprehension: a technique for creating large-sized machine-comprehension (MC) datasets using paragraph-vector models; and a novel, hybrid neural-network architecture that combines the representation power of recursive neural networks with the discriminative power of fully-connected multi-layered networks. We use the MC-dataset generation technique to build a dataset of around 2 million examples, for which we empirically determine the high-ceiling of human performance (around 91\% accuracy), as well as the performance of a variety of computer models. Among all the models we have experimented with, our hybrid neural-network architecture achieves the highest performance (83.2\% accuracy). The remaining gap to the human-performance ceiling provides enough room for future model improvements.
View details
Preview abstract
We describe a new multi-modal task for computer systems, posed as a combined vision-language comprehension challenge: identify the most suitable \emph{text} describing a scene, given several similar options. Accomplishing the task entails demonstrating comprehension beyond just recognizing ``keywords'' (or key-phrases) and their corresponding visual concepts, and instead
requires an alignment between the representations of the two modalities that achieves a visually-grounded ``understanding'' of various linguistic elements and their dependencies. This new task also admits an easy-to-compute and well-understood metric: the accuracy in detecting the true target among the decoys.
The paper makes several contributions: a generic mechanism for generating decoys from (human-created) image captions; an instance of applying this mechanism, yielding a large-scale machine comprehension dataset (based on the COCO images and captions) that we make publicly available;
results on a human evaluation on this dataset, thus providing a performance ceiling; and several baseline and competitive learning approaches that illustrate the utility of the proposed framework in advancing both image and language machine comprehension. In particular, there is a large gap between human performance and state-of-the-art learning methods, suggesting a fruitful direction for future research.
View details
Embedding Inference for Structured Multilabel Prediction
Preview
Farzaneh Mirzazadeh
Siamak Ravanbakhsh
Dale Schuurmans
Advances in Neural Information Processing Systems (2015)
Preview abstract
We present a family of neural-network–inspired models for computing continuous word representation, specifically designed to exploit monolingual and multilingual text, without and with annotations (syntactic dependencies, word alignments, etc.).
We find that this framework allows us to train embeddings with significantly
higher accuracy on syntactic and semantic compositionality, as well as multilingual semantic similarity, compared to previous models. We also show that some of these embeddings can be used to improve the performance of a state-of-the-art machine translation system for words outside the vocabulary of the parallel training data.
View details
Preview abstract
Quantum annealing is a heuristic quantum algorithm which exploits quantum resources to minimize an objective function embedded as the energy levels of a programmable physical system. To take advantage of a potential quantum advantage, one needs to be able to map the problem of interest to the native hardware with reasonably low overhead. Because experimental considerations constrain our objective function to take the form of a low degree PUBO (polynomial unconstrained binary optimization), we employ non-convex loss functions which are polynomial functions of the margin. We show that these loss functions are robust to label noise and provide a clear advantage over convex methods. These loss functions may also be useful for classical approaches as they compile to regularized risk expressions which can be evaluated in constant time with respect to the number of training examples.
View details
Bayesian Sampling using Stochastic Gradient Thermostats
Youhan Fang
Changyou Chen
Robert Skeel
Advances in Neural Information Processing Systems (2014), pp. 3203-3211
Preview abstract
Dynamics-based sampling methods, such as Hybrid Monte Carlo (HMC) and Langevin dynamics (LD), are commonly used to sample target distributions. Recently, such approaches have been combined with stochastic gradient techniques to increase sampling efficiency when dealing with large datasets. An outstanding problem with this approach is that the stochastic gradient introduces an unknown amount of noise which can prevent proper sampling after discretization. To remedy this problem, we show that one can leverage a small number of additional variables to stabilize momentum fluctuations induced by the unknown noise. Our method is inspired by the idea of a thermostat in statistical physics and is justified by a general theory.
View details
Large-Scale Object Classification Using Label Relation Graphs
Jia Deng
Yangqing Jia
Andrea Frome
Samy Bengio
Yuan Li
European Conference on Computer Vision (2014)
Preview abstract
. In this paper we study how to perform object classification in
a principled way that exploits the rich structure of real world labels. We
develop a new model that allows encoding of flexible relations between
labels. We introduce Hierarchy and Exclusion (HEX) graphs, a new formalism
that captures semantic relations between any two labels applied
to the same object: mutual exclusion, overlap and subsumption. We then
provide rigorous theoretical analysis that illustrates properties of HEX
graphs such as consistency, equivalence, and computational implications
of the graph structure. Next, we propose a probabilistic classification
model based on HEX graphs and show that it enjoys a number of desirable
properties. Finally, we evaluate our method using a large-scale
benchmark. Empirical results demonstrate that our model can signifi-
cantly improve object classification by exploiting the label relations.
View details
Differential Topic Models
Changyou Chen
Wray Buntine
Lexing Xie
Lan Du
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37 (2015), pp. 230-242