Jump to Content
Nan Ding

Nan Ding

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Effective scaling and a flexible task interface enable large-capacity language models to excel at many tasks. PaLI (Pathways Language and Image model) extends these ideas to the joint modeling of language and vision. PaLI is a model that generates text based on visual and textual inputs. Using this API, PaLI is able to perform many vision, language, and multimodal tasks, across many languages. We train PaLI with two main principles: reuse of pretrained unimodal components, and joint scaling of modalities. Using large-capacity pretrained language models and vision models allows us to capitalize on their existing capabilities, while leveraging the substantial cost of training them. We scale PaLI models across three axes:the language component, the vision component, and the training data that fuses them. For the vision component, we train the largest and best-performing VisionTransformer (ViT) to date. For the data, we build an image-text training set over10B images and covering over 100 languages. PaLI inherits and enhances language-understanding capabilities, and achieves state-of-the-art in multiple vision and language tasks (image classification, image captioning, visual question-answering, scene-text understanding, etc.), based on a simple, modular, and reuse-friendly platform for modeling and scaling. View details
    Preview abstract Visual Question Answering (VQA) has benefited from increasingly sophisticated models, but has not enjoyed the same level of engagement in terms of data creation. In this paper, we propose a method that automatically derives VQA examples at volume, by leveraging the abundance of existing image-caption annotations combined with neural models for textual question generation. We show that the resulting data is of high-quality. VQA models trained on our data improve state-of-the-art zero-shot accuracy by double digits and achieve a level of robustness that lacks in the same model trained on human-annotated VQA data. View details
    Preview abstract The availability of large-scale image captioning and visual question answering datasets has contributed significantly to recent successes in vision-and-language pre-training. However, these datasets are often collected with overrestrictive requirements, inherited from their original target tasks (e.g., image caption generation), which limit the resulting dataset scale and diversity. We take a step further in pushing the limits of vision-and-language pre-training data by relaxing the data collection pipeline used in Conceptual Captions 3M (CC3M) [Sharma et al. 2018] and introduce the Conceptual 12M (CC12M), a dataset with 12 million image-text pairs specifically meant to be used for vision-and-language pre-training. We perform an analysis of this dataset, as well as benchmark its effectiveness against CC3M on multiple downstream tasks with an emphasis on long-tail visual recognition. The quantitative and qualitative results clearly illustrate the benefit of scaling up pre-training data for vision-and-language tasks, as indicated by the new state-of-the-art results on both the nocaps and Conceptual Captions benchmarks. View details
    Preview abstract Despite recent advances in its theoretical understanding, there still remains a significant gap in the ability of existing meta-learning theorems to explain the performance improvements in the few-shot learning setting, where the number of samples in the target tasks is severely limited. This gap originates from an assumption in the existing theories which supposes that the number of samples in the observed tasks and the number of samples in the target tasks follow the same distribution, an assumption that rarely holds in practice. By relaxing this assumption we develop two PAC-Bayesian bounds tailored for the few-shot learning setting and show that two existing meta-learning algorithms (MAML and Reptile) can be derived from our bounds, thereby bridging the gap between practice and PAC-Bayesian theorems. Furthermore, we derive a new computationally efficient PAC-Bayesian algorithm, and show it outperforms existing meta-learning algorithms on several few-shot benchmark datasets. View details
    Preview abstract This paper introduces TeaForN, an extension of the teacher-forcing method to N-grams. Sequence generation models trained with teacher-forcing suffer from problems such as exposure bias and lack of differentiability across timesteps. TeaForN addresses both these problems directly, through the use of a stack of N decoders trained to decode along a secondary time axis that allows model-parameter updates based on N prediction steps. Unlike other approaches, TeaForN can be used with a wide class of decoder architectures and requires minimal modifications from a standard teacher-forcing setup. Empirically, we show that TeaForN boosts model quality and beam-efficiency against several sequence generation benchmarks. View details
    Characterizing Quantum Supremacy in Near-Term Devices
    Sergei Isakov
    Vadim Smelyanskiy
    Michael J. Bremner
    John Martinis
    Nature Physics, vol. 14 (2018), 595–600
    Preview abstract A critical question for quantum computing in the near future is whether quantum devices without error correction can perform a well-defined computational task beyond the capabilities of supercomputers. Such a demonstration of what is referred to as quantum supremacy requires a reliable evaluation of the resources required to solve tasks with classical approaches. Here, we propose the task of sampling from the output distribution of random quantum circuits as a demonstration of quantum supremacy. We extend previous results in computational complexity to argue that this sampling task must take exponential time in a classical computer. We introduce cross-entropy benchmarking to obtain the experimental fidelity of complex multiqubit dynamics. This can be estimated and extrapolated to give a success metric for a quantum supremacy demonstration. We study the computational cost of relevant classical algorithms and conclude that quantum supremacy can be achieved with circuits in a two-dimensional lattice of 7 × 7 qubits and around 40 clock cycles. This requires an error rate of around 0.5% for two-qubit gates (0.05% for one-qubit gates), and it would demonstrate the basic building blocks for a fault-tolerant quantum computer View details
    Preview abstract We present a new dataset of image caption annotations, CHIA, which contains an order of magnitude more images than the MS-COCO dataset and represents a wider variety of both image and image caption styles. We achieve this by extracting and filtering image caption annotations from billions of Internet webpages. We also present quantitative evaluations of a number of image captioning models and show that a model architecture based on Inception-ResNet-v2 CNN for image-feature extraction and Transformer for sequence modeling achieves best performance when trained on the CHIA dataset. We present a new dataset of image caption annotations, Conceptual Captions, which contains an order of magnitude more images than the MS-COCO dataset and represents a wider variety of both images and image caption styles. We achieve this by extracting and filtering image caption annotations from billions of webpages. We also present quantitative evaluations of a number of image captioning models and show that a model architecture based on Inception-ResNet-v2 for image-feature extraction and Transformer for sequence modeling achieves the best performance when trained on the Conceptual Captions dataset. View details
    Preview abstract Supervised training of abstractive language generation models results in learning conditional probabilities over language sequences based on the supervised training signal. When the training signal contains a variety of writing styles, such models may end up learning an 'average' style that is directly influenced by the training data make-up and cannot be controlled by the needs of an application. We describe a family of model architectures capable of capturing both generic language characteristics via shared model parameters, as well as particular style characteristics via private model parameters. Such models are able to generate language according to a specific learned style, while still taking advantage of their power to model generic language phenomena. Furthermore, we describe an extension that uses a mixture of output distributions from all learned styles to perform on-the-fly style adaptation based on the textual input alone. Experimentally, we find that the proposed models consistently outperform models that encapsulate single-style or average-style language generation capabilities. View details
    Preview abstract Policy-gradient approaches to reinforcement learning have two common and undesirable overhead procedures, namely warm-start training and sample variance reduction. In this paper, we describe a reinforcement learning method based on a softmax policy that requires neither of these procedures. Our method combines the advantages of policy-gradient methods with the efficiency and simplicity of maximum-likelihood approaches. We apply this new cold-start reinforcement learning method in training sequence generation models for structured output prediction problems. Empirical evidence validates this method on automatic summarization and image captioning tasks. View details
    Preview abstract We describe a new multi-modal task for computer systems, posed as a combined vision-language comprehension challenge: identify the most suitable \emph{text} describing a scene, given several similar options. Accomplishing the task entails demonstrating comprehension beyond just recognizing ``keywords'' (or key-phrases) and their corresponding visual concepts, and instead requires an alignment between the representations of the two modalities that achieves a visually-grounded ``understanding'' of various linguistic elements and their dependencies. This new task also admits an easy-to-compute and well-understood metric: the accuracy in detecting the true target among the decoys. The paper makes several contributions: a generic mechanism for generating decoys from (human-created) image captions; an instance of applying this mechanism, yielding a large-scale machine comprehension dataset (based on the COCO images and captions) that we make publicly available; results on a human evaluation on this dataset, thus providing a performance ceiling; and several baseline and competitive learning approaches that illustrate the utility of the proposed framework in advancing both image and language machine comprehension. In particular, there is a large gap between human performance and state-of-the-art learning methods, suggesting a fruitful direction for future research. View details
    What is the Computational Value of Finite Range Tunneling?
    Sergei Isakov
    Vadim Smelyanskiy
    John Martinis
    Physical Review X, vol. 6 (2016), pp. 031015
    Preview abstract Quantum annealing (QA) has been proposed as a quantum enhanced optimization heuristic exploiting tunneling. Here, we demonstrate how finite-range tunneling can provide considerable computational advantage. For a crafted problem designed to have tall and narrow energy barriers separating local minima, the D-Wave 2X quantum annealer achieves significant runtime advantages relative to simulated annealing (SA). For instances with 945 variables, this results in a time-to-99%-success-probability that is ~1e8 times faster than SA running on a single processor core. We also compare physical QA with the quantum Monte Carlo algorithm, an algorithm that emulates quantum tunneling on classical processors. We observe a substantial constant overhead against physical QA: D-Wave 2X again runs up to ~ 1e8 times faster than an optimized implementation of the quantum Monte Carlo algorithm on a single core. We note that there exist heuristic classical algorithms that can solve most instances of Chimera structured problems in a time scale comparable to the D-Wave 2X. However, it is well known that such solvers will become ineffective for sufficiently dense connectivity graphs. To investigate whether finite-range tunneling will also confer an advantage for problems of practical interest, we conduct numerical studies on binary optimization problems that cannot yet be represented on quantum hardware. For random instances of the number partitioning problem, we find numerically that algorithms designed to simulate QA scale better than SA. We discuss the implications of these findings for the design of next-generation quantum annealers. View details
    Scalable Quantum Simulation of Molecular Energies
    Ian Kivlichan
    Jonathan Romero
    Rami Barends
    Andrew Tranter
    Brooks Campbell
    Yu Chen
    Zijun Chen
    Ben Chiaro
    Andrew Dunsworth
    Anthony Megrant
    Josh Mutus
    Charles Neil
    Jim Wenner
    Amit Vainsencher
    Peter Coveney
    Peter Love
    Alán Aspuru-Guzik
    John Martinis
    Physical Review X, vol. 6 (2016), pp. 031007
    Preview abstract We report the first electronic structure calculation performed on a quantum computer without exponentially costly precompilation. We use a programmable array of superconducting qubits to compute the energy surface of molecular hydrogen using two distinct quantum algorithms. First, we experimentally execute the unitary coupled cluster method using the variational quantum eigensolver. Our efficient implementation predicts the correct dissociation energy to within chemical accuracy of the numerically exact result. Second, we experimentally demonstrate the canonical quantum algorithm for chemistry, which consists of Trotterization and quantum phase estimation. We compare the experimental performance of these approaches to show clear evidence that the variational quantum eigensolver is robust to certain errors. This error tolerance inspires hope that variational quantum simulations of classically intractable molecules may be viable in the near future. View details
    Preview abstract We present a dual contribution to the task of machine reading-comprehension: a technique for creating large-sized machine-comprehension (MC) datasets using paragraph-vector models; and a novel, hybrid neural-network architecture that combines the representation power of recursive neural networks with the discriminative power of fully-connected multi-layered networks. We use the MC-dataset generation technique to build a dataset of around 2 million examples, for which we empirically determine the high-ceiling of human performance (around 91\% accuracy), as well as the performance of a variety of computer models. Among all the models we have experimented with, our hybrid neural-network architecture achieves the highest performance (83.2\% accuracy). The remaining gap to the human-performance ceiling provides enough room for future model improvements. View details
    Preview abstract We present a family of neural-network–inspired models for computing continuous word representation, specifically designed to exploit monolingual and multilingual text, without and with annotations (syntactic dependencies, word alignments, etc.). We find that this framework allows us to train embeddings with significantly higher accuracy on syntactic and semantic compositionality, as well as multilingual semantic similarity, compared to previous models. We also show that some of these embeddings can be used to improve the performance of a state-of-the-art machine translation system for words outside the vocabulary of the parallel training data. View details
    Embedding Inference for Structured Multilabel Prediction
    Farzaneh Mirzazadeh
    Siamak Ravanbakhsh
    Dale Schuurmans
    Advances in Neural Information Processing Systems (2015)
    On the Convergence of Stochastic Gradient MCMC Algorithms with High-Order Integrators
    Changyou Chen
    Lawrence Carin
    Advances in Neural Information Processing Systems (2015)
    Bayesian Sampling using Stochastic Gradient Thermostats
    Youhan Fang
    Changyou Chen
    Robert Skeel
    Advances in Neural Information Processing Systems (2014), pp. 3203-3211
    Preview abstract Dynamics-based sampling methods, such as Hybrid Monte Carlo (HMC) and Langevin dynamics (LD), are commonly used to sample target distributions. Recently, such approaches have been combined with stochastic gradient techniques to increase sampling efficiency when dealing with large datasets. An outstanding problem with this approach is that the stochastic gradient introduces an unknown amount of noise which can prevent proper sampling after discretization. To remedy this problem, we show that one can leverage a small number of additional variables to stabilize momentum fluctuations induced by the unknown noise. Our method is inspired by the idea of a thermostat in statistical physics and is justified by a general theory. View details
    Preview abstract Quantum annealing is a heuristic quantum algorithm which exploits quantum resources to minimize an objective function embedded as the energy levels of a programmable physical system. To take advantage of a potential quantum advantage, one needs to be able to map the problem of interest to the native hardware with reasonably low overhead. Because experimental considerations constrain our objective function to take the form of a low degree PUBO (polynomial unconstrained binary optimization), we employ non-convex loss functions which are polynomial functions of the margin. We show that these loss functions are robust to label noise and provide a clear advantage over convex methods. These loss functions may also be useful for classical approaches as they compile to regularized risk expressions which can be evaluated in constant time with respect to the number of training examples. View details
    Large-Scale Object Classification Using Label Relation Graphs
    Jia Deng
    Yangqing Jia
    Andrea Frome
    Samy Bengio
    Yuan Li
    European Conference on Computer Vision (2014)
    Preview abstract . In this paper we study how to perform object classification in a principled way that exploits the rich structure of real world labels. We develop a new model that allows encoding of flexible relations between labels. We introduce Hierarchy and Exclusion (HEX) graphs, a new formalism that captures semantic relations between any two labels applied to the same object: mutual exclusion, overlap and subsumption. We then provide rigorous theoretical analysis that illustrates properties of HEX graphs such as consistency, equivalence, and computational implications of the graph structure. Next, we propose a probabilistic classification model based on HEX graphs and show that it enjoys a number of desirable properties. Finally, we evaluate our method using a large-scale benchmark. Empirical results demonstrate that our model can signifi- cantly improve object classification by exploiting the label relations. View details
    Differential Topic Models
    Changyou Chen
    Wray Buntine
    Lexing Xie
    Lan Du
    IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37 (2015), pp. 230-242