Quoc V. Le

Quoc V. Le

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Solving olympiad geometry without human demonstrations
    Trieu Trinh
    Yuhuai Tony Wu
    He He
    Nature, 625(2024), pp. 476-482
    Preview abstract Proving mathematical theorems at the olympiad level represents a notable milestone in human-level automated reasoning, owing to their reputed difficulty among the world’s best talents in pre-university mathematics. Current machine-learning approaches, however, are not applicable to most mathematical domains owing to the high cost of translating human proofs into machine-verifiable format. The problem is even worse for geometry because of its unique translation challenges, resulting in severe scarcity of training data. We propose AlphaGeometry, a theorem prover for Euclidean plane geometry that sidesteps the need for human demonstrations by synthesizing millions of theorems and proofs across different levels of complexity. AlphaGeometry is a neuro-symbolic system that uses a neural language model, trained from scratch on our large-scale synthetic data, to guide a symbolic deduction engine through infinite branching points in challenging problems. On a test set of 30 latest olympiad-level problems, AlphaGeometry solves 25, outperforming the previous best method that only solves ten problems and approaching the performance of an average International Mathematical Olympiad (IMO) gold medallist. Notably, AlphaGeometry produces human-readable proofs, solves all geometry problems in the IMO 2000 and 2015 under human expert evaluation and discovers a generalized version of a translated IMO theorem in 2004. View details
    Noise2Music: Text-conditioned Music Generation with Diffusion Models
    Qingqing Huang
    Daniel S. Park
    Tao Wang
    Nanxin Chen
    Zhengdong Zhang
    Zhishuai Zhang
    Jiahui Yu
    Christian Frank
    William Chan
    Wei Han
    (2023)
    Preview abstract We introduce Noise2Music, where a series of diffusion models are trained to generate high-quality 30-second music clips from text prompts. Two types of diffusion models, a generator model, which generates an intermediate representation conditioned on text, and a cascader model, which generates high-fidelity audio conditioned on the intermediate representation and possibly the text, are trained and utilized in succession to generate high-fidelity music. We explore two options for the intermediate representation, one in which it is a spectrogram and the other in which it is audio with lower fidelity. We find that the generated audio is able to faithfully reflect key elements of the text prompt such as genre, mood, tempo and instruments. Language models play a key role in this story---they are used to generate paired text for the audio of the training set and to extract embeddings of the text prompts ingested by the diffusion models. View details
    The Flan Collection: Designing Data and Methods for Effective Instruction Tuning
    Shayne Longpre
    Le Hou
    Albert Webson
    Hyung Won Chung
    Yi Tay
    Barret Zoph
    Jason Wei
    Proceedings of the 40th International Conference on Machine Learning, PMLR(2023), pp. 22631-22648
    Preview abstract We study the design decisions of publicly available instruction tuning methods, and break down the development of Flan 2022 (Chung et al., 2022). Through careful ablation studies on the Flan Collection of tasks and methods, we tease apart the effect of design decisions which enable Flan-T5 to outperform prior work by 3-17%+ across evaluation settings. We find task balancing and enrichment techniques are overlooked but critical to effective instruction tuning, and in particular, training with mixed prompt settings (zero-shot, few-shot, and chain-of-thought) actually yields stronger (2%+) performance in all settings. In further experiments, we show Flan-T5 requires less finetuning to converge higher and faster than T5 on single downstream tasks, motivating instruction-tuned models as more computationally-efficient starting checkpoints for new tasks. Finally, to accelerate research on instruction tuning, we make the Flan 2022 collection of datasets, templates, and methods publicly available at https://github.com/google-research/FLAN/tree/main/flan/v2. View details
    Preview abstract Chain-of-thought prompting combined with pre-trained large language models has achieved encouraging results on complex reasoning tasks. In this paper, we propose a new decoding strategy, self-consistency, to replace the naive greedy decoding used in chain-of-thought prompting. It first samples a diverse set of reasoning paths instead of only taking the greedy one, and then selects the most consistent answer by marginalizing out the sampled reasoning paths. Self-consistency leverages the intuition that a complex reasoning problem typically admits multiple different ways of thinking leading to its unique correct answer. Our extensive empirical evaluation shows that self-consistency boosts the performance of chain-of-thought prompting with a striking margin on a range of popular arithmetic and commonsense reasoning benchmarks, including GSM8K (+17.9%), SVAMP (+11.0%), AQuA (+12.2%), StrategyQA (+6.4%) and ARC-challenge (+3.9%). View details
    Finetuned Language Models are Zero-Shot Learners
    Jason Wei
    Maarten Paul Bosma
    Vincent Zhao
    Adams Wei Yu
    Nan Du
    International Conference on Learning Representations(2022)
    Preview abstract This paper explores a simple method for improving the zero-shot learning abilities of language models. We show that instruction tuning---finetuning language models on a collection of tasks described via instructions---substantially boosts zero-shot performance on unseen tasks. We take a 137B parameter pretrained language model and instruction-tune it on over 60 NLP tasks verbalized via natural language instruction templates. We evaluate this instruction-tuned model, which we call FLAN, on unseen task types. FLAN substantially improves the performance of its unmodified counterpart and surpasses zero-shot 175B GPT-3 on 20 of 25 tasks that we evaluate. FLAN even outperforms few-shot GPT-3 by a large margin on ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze. Ablation studies reveal that number of tasks and model scale are key components to the success of instruction tuning. View details
    Sparsely Activated Language Models are Efficient In-Context Learners
    Adams Yu
    Barret Richard Zoph
    Dmitry (Dima) Lepikhin
    Emma Wang
    Kun Zhang
    Liam B. Fedus
    Maarten Paul Bosma
    Marie Pellat
    Maxim Krikun
    Nan Du
    Simon Tong
    Tao Wang
    Toju Duke
    Yuanzhong Xu
    Zongwei Zhou
    (2022)
    Preview abstract Scaling language models with more data, compute and parameters has driven significant progress in natural language processing. For example, thanks to scaling, GPT-3 was able to achieve strong performance on few-shot learning. However, training these large dense models require significant amounts of computing resources. In this paper, we develop a family of sparsely activated mixture-of-expert language models named \glam (\textbf{G}eneralist \textbf{La}nguage \textbf{M}odel), which can have many more parameters but require significant less training cost than dense models. The largest \glam has 1.2 trillion parameters, which is approximately 7x larger than GPT-3 but can be trained more efficiently. With only 1/3 of energy consumption to train GPT-3, \glam achieves better overall performance on 29 zero-shot and one-shot NLP tasks. For example, \glam gets 75.0\% one-shot exact match accuracy on the TriviaQA test server, a significant improvement over 68.0\% obtained by GPT-3. View details
    LaMDA: Language Models for Dialog Applications
    Aaron Daniel Cohen
    Alena Butryna
    Alicia Jin
    Apoorv Kulshreshtha
    Ben Zevenbergen
    Chung-ching Chang
    Cosmo Du
    Daniel De Freitas Adiwardana
    Dehao Chen
    Dmitry (Dima) Lepikhin
    Erin Hoffman-John
    Hongrae Lee
    Igor Krivokon
    James Qin
    Jamie Hall
    Joe Fenton
    Johnny Soraker
    Lora Mois Aroyo
    Maarten Paul Bosma
    Marc Joseph Pickett
    Marcelo Amorim Menegali
    Marian Croak
    Maxim Krikun
    Meredith Ringel Morris
    Noam Shazeer
    Rachel Bernstein
    Ravi Rajakumar
    Ray Kurzweil
    Romal Thoppilan
    Steven Zheng
    Taylor Bos
    Toju Duke
    Tulsee Doshi
    Vincent Y. Zhao
    Will Rusch
    Yuanzhong Xu
    arXiv(2022)
    Preview abstract We present LaMDA: Language Models for Dialog Applications. LaMDA is a family of Transformer-based neural language models specialized for dialog, which have up to 137B parameters and arepre-trained on 1.56T words of public dialog data and web text. While model scaling alone canimprove quality, it shows less improvements on safety and factual grounding. We demonstrate thatfine-tuning with annotated data and enabling the model to consult external knowledge sources canlead to significant improvements towards the two key challenges of safety and factual grounding.The first challenge, safety, involves ensuring that the model’s responses are consistent with a set ofhuman values, such as preventing harmful suggestions and unfair bias. We quantify safety using ametric based on an illustrative set of values, and we find that filtering candidate responses using aLaMDA classifier fine-tuned with a small amount of crowdworker-annotated data offers a promisingapproach to improving model safety. The second challenge, factual grounding, involves enabling themodel to consult external knowledge sources, such as an information retrieval system, a languagetranslator, and a calculator. We quantify factuality using a groundedness metric, and we find that ourapproach enables the model to generate responses grounded in known sources, rather than responsesthat merely sound plausible. Finally, we explore the use of LaMDA in the domains of education andcontent recommendations, and analyze their helpfulness and role consistency. View details
    Preview abstract Many recent papers highlight the importance of thinking about carbon emissions (CO2e) in machine learning (ML) workloads. While elevating the discussion, some early work was also based on incomplete information. (Unfortunately, the most widely cited quantitative estimate that was the basis for many of these papers was off by 88X.) Inspired by these concerns, we looked for approaches that would make ML training considerably less carbon intensive. We identified four best practices that dramatically reduce carbon emissions, and demonstrate two concrete examples of reducing CO2e by 650X over four years and 40X over one year by following them. Provided ML stakeholders follow best practices, we predict that the field will bend the curve of carbon footprint increases from ML training runs to first flatten and then reduce it by 2030 without sacrificing the current rate of rapid advances in ML, contrary to prior dire warnings that ML CO2e will soar. View details
    TabNAS: Rejection Sampling for Neural Architecture Search on Tabular Datasets
    Gabriel M. Bender
    Hanxiao Liu
    Madeleine Udell
    Yifeng Lu
    Da Huang
    Neural Information Processing Systems(2022)
    Preview abstract The best neural architecture for a given machine learning problem depends on many factors: not only the complexity and structure of the dataset, but also on resource constraints including latency, compute, energy consumption, etc. Neural architecture search (NAS) for tabular datasets is an important but under-explored problem. Previous NAS algorithms designed for image search spaces incorporate resource constraints directly into the reinforcement learning (RL) rewards. However, for NAS on tabular datasets, this protocol often discovers suboptimal architectures. This paper develops TabNAS, a new and more effective approach to handle resource constraints in tabular NAS using an RL controller motivated by the idea of rejection sampling. TabNAS immediately discards any architecture that violates the resource constraints without training or learning from that architecture. TabNAS uses a Monte-Carlo-based correction to the RL policy gradient update to account for this extra filtering step. Results on several tabular datasets demonstrate the superiority of TabNAS over previous reward-shaping methods: it finds better models that obey the constraints. View details
    Evolving Reinforcement Learning Algorithms
    JD Co-Reyes
    Yingjie Miao
    Daiyi Peng
    Sergey Levine
    Honglak Lee
    International Conference on Learning Representations (ICLR)(2021) (to appear)
    Preview abstract We propose a method for meta-learning reinforcement learning algorithms by searching over the space of computational graphs which compute the loss function for a value-based model-free RL agent to optimize. The learned algorithms are domain-agnostic and can generalize to new environments not seen during training. Our method can both learn from scratch and bootstrap off known existing algorithms, like DQN, enabling interpretable modifications which improve performance. Learning from scratch on simple classical control and gridworld tasks, our method rediscovers the temporal-difference (TD) algorithm. Bootstrapped from DQN, we highlight two learned algorithms which obtain good generalization performance over other classical control tasks, gridworld type tasks, and Atari games. The analysis of the learned algorithm behavior shows resemblance to recently proposed RL algorithms that address overestimation in value-based methods. View details
    Preview abstract Pre-trained representations are becoming crucial for many NLP and perception tasks. While representation learning in NLP has transitioned to training on raw text without human annotations, visual and vision-language representations still rely heavily on curated training datasets that are expensive or require expert knowledge. For vision applications, representations are mostly learned using datasets with explicit class labels such as ImageNet or OpenImages. For vision-language, popular datasets like Conceptual Captions, MSCOCO, or CLIP all involve a non-trivial data collection (and cleaning) process. This costly curation process limits the size of datasets and hence hinders the scaling of trained models. In this paper, we leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps in the Conceptual Captions dataset. A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss. We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme. Our visual representation achieves strong performance when transferred to classification tasks such as ImageNet and VTAB. The aligned visual and language representations also set new state-of-the-art results on Flickr30K and MSCOCO benchmarks, even when compared with more sophisticated cross-attention models. The representations also enable cross-modality search with complex text and text + image queries. View details
    Searching for Fast Models on Datacenter Accelerators
    Sheng Li
    Ruoming Pang
    Andrew Li
    Norm Jouppi
    Conference on Computer Vision and Pattern Recognition(2021)
    Preview abstract Neural Architecture Search (NAS), together with model scaling, has shown remarkable progress in designing high accuracy and fast convolutional architecture families. However, as neither NAS nor model scaling considers sufficient hardware architecture details, they do not take full advantage of the emerging datacenter (DC) accelerators. In this paper, we search for fast and accurate CNN model families for efficient inference on DC accelerators. We first analyze DC accelerators and find that existing CNNs suffer from insufficient operational intensity, parallelism, and execution efficiency and exhibit FLOPs-latency nonproportionality. These insights let us create a DC-accelerator-optimized search space, with space-to-depth, space-to-batch, hybrid fused convolution structures with vanilla and depthwise convolutions, and block-wise activation functions. We further propose a latency-aware compound scaling (LACS), the first multi-objective compound scaling method optimizing both accuracy and latency. Our LACS discovers that network depth should grow much faster than image size and network width, which is quite different from the observations from previous compound scaling. With the new search space and LACS, our search and scaling on datacenter accelerators results in a new model series named EfficientNet-X. EfficientNet-X is up to more than 2X faster than EfficientNet (a model series with state-of-the-art trade-off on FLOPs and accuracy) on TPUv3 and GPUv100, with comparable accuracy. EfficientNet-X is also up to 7X faster than recent RegNet and ResNeSt on TPUv3 and GPUv100. Source code is at https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet/tpu View details
    EfficientNetV2: Smaller Models and Faster Training
    International Conference on Machine Learning (ICML)(2021)
    Preview abstract This paper introduces EfficientNetV2, a new family of convolutional networks that have faster training speed and better parameter efficiency than previous models. To develop these models, we use a combination of training-aware neural architecture search and scaling, to jointly optimize training speed and parameter efficiency. The models were searched from the search space enriched with new ops such as Fused-MBConv. Our experiments show that EfficientNetV2 models train much faster than state-of-the-art models while being up to 6.8x smaller. Our training can be further sped up by progressively increasing the image size during training, but it often causes a drop in accuracy. To compensate for this accuracy drop, we propose an improved method of progressive learning, which adaptively adjusts regularization (e.g. data augmentation) along with image size. With progressive learning, our EfficientNetV2 significantly outperforms previous models on ImageNet and CIFAR/Cars/Flowers datasets. By pretraining on the same ImageNet21k, our EfficientNetV2 achieves 87.3% top-1 accuracy on ImageNet ILSVRC2012, outperforming the recent ViT by 2.0% accuracy while training 5x-11x faster using the same computing resources. Code is available at https://github.com/google/automl/tree/master/efficientnetv2. View details
    Primer: Searching for Efficient Transformers for Language Modeling
    David Richard So
    Wojciech Andrzej Mańke
    Hanxiao Liu
    Zihang Dai
    Noam Shazeer
    Conference on Neural Information Processing Systems(2021)
    Preview abstract Large Transformer models have been central to recent advances in natural language processing. The training and inference costs of these models, however, have grown rapidly and become prohibitively expensive. Here we aim to reduce the costs of Transformers by searching for a more efficient variant. Compared to previous approaches, our search is performed at a lower level, over the primitives that define a Transformer TensorFlow program. We identify an architecture, named Primer, that has a smaller training cost than the original Transformer and other variants for auto-regressive language modeling. Primer's improvements can be mostly attributed to two simple modifications: squaring ReLU activations and adding a depthwise convolution layer after each Q, K, and V projection in self-attention. Experiments show Primer's gains over Transformer increase as compute scale grows and follow a power law with respect to quality at optimal model sizes. We also verify empirically that Primer can be dropped into different codebases to significantly speed up training without additional tuning. For example, at a 500M parameter size, Primer improves the original T5 architecture on C4 auto-regressive language modeling, reducing the training cost by 4X. Furthermore, the reduced training cost means Primer needs much less compute to reach a target one-shot performance. For instance, in a 1.9B parameter configuration similar to GPT-3 XL, Primer uses 1/3 of the training compute to achieve the same one-shot performance as Transformer. We open source our models and several comparisons in T5 to help with reproducibility. View details
    Program Synthesis with Large Language Models
    Augustus Odena
    David Martin Dohan
    Ellen Jiang
    Henryk Michalewski
    Maarten Paul Bosma
    Maxwell Nye
    n/a, n/a, n/a(2021), n/a
    Preview abstract Program synthesis is one of the grand challenges of artificial intelligence, but to date practical successes have focused on narrow settings and restricted domains. Large language models trained on massive corpora of web texts which include open-source code, programming websites, and tutorials have the potential to break through this barrier.This paper explores the limits of the current generation of large language models for program synthesis in general purpose programming languages. We evaluate the performance of the language model LaMDA PT [Freitas et al.,2021] on several program synthesis tasks, at a variety of scales ranging from 244M to 137B parameters. First, we introduce a new benchmark, Mostly Basic Programming Problems (MBPP), to measure the ability of these models to synthesize short Python programs from natural language descriptions. The benchmark consists of around 1000 crowd-sourced Python programming problems, designed to be solvable by entry level programmers, covering programming fundamentals, standard library functionality, and so on. Each problem consists of a task description, code solution and automated test-cases. We also introduce a Python version of the MathQA benchmark, which evaluates the ability of the models to synthesize code from more complex text. On both datasets, we evaluate synthesis performance and find that synthesis performance scales log-linearly with model size. In contrast to some previous work, we find that LaMDAPT achieves non-negligible preformance in a few-shot setting, although fine-tuning still performs much better. Thel argest models we consider can synthesize solutions to 58% of the problems from MBPP using few-shot learning with a well-designed prompt; across model sizes, fine-tuning on a held-out portion of the dataset improves performance by about 10 percentage points. Finally, we conduct a thorough error analysis, shedding light on where these models fall short as program synthesizers, what types of programs are most difficult to generate, and how the models might be improved. As part of that analysis, we explore the semantic grounding of these models, finding that even our largest models are generally unable to predict the output of a program given a specific input. View details
    Meta Pseudo Labels
    Hieu Pham
    Zihang Dai
    Qizhe Xie
    IEEE Conference on Computer Vision and Pattern Recognition(2021)
    Preview abstract We present Meta Pseudo Labels, a semi-supervised learning method that achieves a new state-of-the-art top-1 accuracy of 90.2% on ImageNet, which is 1.6% better than the existing state-of-the-art. Like Pseudo Labels, Meta Pseudo Labels has a teacher networ View details
    AutoHAS: Efficient Hyperparameter and Architecture Search
    Xuanyi Dong
    Adams Wei Yu
    Daiyi Peng
    Bogdan Gabrys
    Workshop on Neural Architecture Search at International Conference on Learning Representations (NAS@ICLR)(2021)
    Preview abstract Efficient hyperparameter or architecture search methods have shown remarkable results, but each of them is only applicable to searching for either hyperparameters (HPs) or architectures. In this work, we propose a unified pipeline, AutoHAS, to efficiently search for both architectures and hyperparameters. AutoHAS learns to alternately update the shared network weights and a reinforcement learning (RL) controller, which learns the probability distribution for the architecture candidates and HP candidates. A temporary weight is introduced to store the updated weight from the selected HPs (by the controller), and a validation accuracy based on this temporary weight serves as a reward to update the controller. In experiments, we show AutoHAS is efficient and generalizable to different search spaces, baselines and datasets. In particular, AutoHAS can improve the accuracy over popular network architectures, such as ResNet and EfficientNet, on CIFAR-10/100, ImageNet, and four more other datasets. View details
    Preview abstract Despite recent success, most contrastive self-supervised learning methods are domain-specific, relying heavily on data augmentation techniques that require knowledge about a particular domain, such as image cropping and rotation. To overcome such limitation, we propose a novel domain-agnostic approach to contrastive learning, named DACL, that is applicable to domains where invariances, and thus, data augmentation techniques, are not readily available. Key to our approach is the use of Mixup noise to create similar and dissimilar examples by mixing data samples differently either at the input or hidden-state levels.To demonstrate the effectiveness of DACL, we conduct experiments across various domains such as tabular data, images, and graphs. Our results show that DACL not only outperforms other domain-agnostic noising methods, such as Gaussian-noise, but also combines well with domain-specific methods, such as SimCLR, to improve self-supervised visual representation learning. Finally, we theoretically analyze our method and show advantages over the Gaussian-noise based contrastive learning approach. View details
    Preview abstract Neural networks are often over-parameterized and hence benefit from aggressive regularization. Conventional regularization methods, such as Dropout or weight decay, do not leverage the structures of the network's inputs and hidden states. As a result, these conventional methods are less effective than methods that leverage the structures, such as SpatialDropout and DropBlock, which randomly drop the values at certain contiguous areas in the hidden states and setting them to zero. Although the locations of dropout areas random, the patterns of SpatialDropout and DropBlock are manually designed and fixed. Here we propose to learn the dropout patterns. In our method, a controller learns to generate a dropout pattern at every channel and layer of a target network, such as a ConvNet or a Transformer. The target network is then trained with the dropout pattern, and its resulting validation performance is used as a signal for the controller to learn from. We show that this method works well for both image recognition on CIFAR-10 and ImageNet, as well as language modeling on Penn Treebank and WikiText-2. The learned dropout patterns also transfers to different tasks and datasets, such as from language model on Penn Treebank to Engligh-French translation on WMT 2014. Our code will be available. View details
    STraTA: Self-Training with Task Augmentation for Better Few-shot Learning
    Grady Simon
    Mohit Iyyer
    Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics
    Preview abstract Despite their recent successes in tackling many NLP tasks, large-scale pre-trained language models do not perform as well in few-shot settings where only a handful of training examples are available. To address this shortcoming, we propose STraTA, which stands for Self-Training with Task Augmentation, an approach that builds on two key ideas for effective leverage of unlabeled data. First, STraTA uses task augmentation, a novel technique that synthesizes a large amount of data for auxiliary-task fine-tuning from target-task unlabeled texts. Second, STraTA performs self-training by further fine-tuning the strong base model created by task augmentation on a broad distribution of pseudo-labeled data. Our experiments demonstrate that STraTA can substantially improve sample efficiency across 12 few-shot benchmarks. Remarkably, on the SST-2 sentiment dataset, STraTA, with only 8 training examples per class, achieves comparable results to standard fine-tuning with 67K training examples. Our analyses reveal that task augmentation and self-training are both complementary and independently effective. View details
    EfficientDet: Scalable and Efficient Object Detection
    Ruoming Pang
    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(2020)
    Preview abstract Model efficiency has become increasingly important in computer vision. In this paper, we systematically study neural network architecture design choices for object detection and propose several key optimizations to improve efficiency. First, we propose a weighted bi-directional feature pyramid network (BiFPN), which allows easy and fast multi-scale feature fusion; Second, we propose a compound scaling method that uniformly scales the resolution, depth, and width for all backbone, feature network, and box/class prediction networks at the same time. Based on these optimizations and better backbones, we have developed a new family of object detectors, called EfficientDet, which consistently achieve much better efficiency than prior art across a wide spectrum of resource constraints. In particular, with single-model and single-scale, our EfficientDet-D7 achieves state-of-the-art 53.7 AP on COCO test-dev with 52M parameters and 325B FLOPs, being 4x - 9x smaller and using 13x - 42x fewer FLOPs than previous detectors. Code is available at https://github.com/google/automl/tree/master/efficientdet. View details
    Towards a Human-like Open-Domain Chatbot
    Apoorv Kulshreshtha
    Daniel De Freitas Adiwardana
    David Richard So
    Gaurav Nemade
    Jamie Hall
    Romal Thoppilan
    Yifeng Lu
    Zi Yang
    arXiv(2020)
    Preview abstract We present Meena, a multi-turn end-to-end open-domain chatbot trained on data mined from public social media and filtered. The model was trained to minimize perplexity of the next token, but we have found evidence that this metric correlates with human judgement of quality. We propose a human judgement metric called Sensibleness and Specificity Average (SSA) which captures key elements of good conversation. Extensive experiments show strong correlation between perplexity and SSA. The fact that Meena scores high on SSA, 72%, on multi-turn evaluation suggests that a human-like chatbot with SSA score of 82% is potentially within reach if we manage to optimize perplexity better. View details
    Preview abstract Convolutional neural networks typically encode an input image into a series of intermediate features with decreasing resolutions. While this structure is suited to classification tasks, it does not perform well for tasks requiring simultaneous recognition and localization (e.g., object detection). The encoder-decoder architectures are proposed to resolve this by applying a decoder network onto a backbone model designed for classification tasks. In this paper, we argue encoder-decoder architecture is ineffective in generating strong multi-scale features because of the scale-decreased backbone. We propose SpineNet, a backbone with scale-permuted intermediate features and cross-scale connections that is learned on an object detection task by Neural Architecture Search. Using similar building blocks, SpineNet models outperform ResNet-FPN models by ~3% AP at various scales while using 10-20% fewer FLOPs. In particular, SpineNet-190 achieves 52.5% AP with a MaskR-CNN detector and achieves 52.1% AP with a RetinaNet detector on COCO for a single model without test-time augmentation, significantly outperforms prior art of detectors. SpineNet can transfer to classification tasks, achieving 5% top-1 accuracy improvement on a challenging iNaturalist fine-grained dataset. Code is at: https://github.com/tensorflow/tpu/tree/master/models/official/detection. View details
    Adversarial Examples Improve Image Recognition
    Cihang Xie
    Boqing Gong
    Jiang Wang
    Alan Yuille
    CVPR(2020)
    Preview abstract Adversarial examples are commonly viewed as a threat to ConvNets. Here we present an opposite perspective: adversarial examples can be used to improve image recognition models if harnessed in the right manner. We propose AdvProp, an enhanced adversarial training scheme which treats adversarial examples as additional examples, to prevent overfitting. Key to our method is the usage of a separate auxiliary batch norm for adversarial examples, as they have different underlying distributions to normal examples. We show that AdvProp improves a wide range of models on various image recognition tasks and performs better when the models are bigger. For instance, by applying AdvProp to the latest EfficientNet-B7 [28] on ImageNet, we achieve significant improvements on ImageNet (+0.7%), ImageNet-C (+6.5%), ImageNet-A (+7.0%), Stylized-ImageNet (+4.8%). With an enhanced EfficientNet-B8, our method achieves the state-of-the-art 85.5% ImageNet top-1 accuracy without extra data. This result even surpasses the best model in [20] which is trained with 3.5B Instagram images (~3000X more than ImageNet) and ~9.4X more parameters View details
    BigNAS: Scaling Up Neural Architecture Search with Big Single-Stage Models
    Jiahui Yu
    Hanxiao Liu
    Gabriel M. Bender
    Thomas Huang
    Xiaodan Song
    Ruoming Pang
    European Conference on Computer Vision (ECCV)(2020)
    Preview abstract Neural architecture search (NAS) has shown promising results discovering models that are both accurate and fast. For NAS, training a one-shot model has become a popular strategy to rank the relative quality of different architectures (child models) using a single set of shared weights. However, while one-shot model weights can effectively rank different network architectures, the absolute accuracies from these shared weights are typically far below those obtained from stand-alone training. To compensate, existing methods assume that the weights must be retrained, finetuned, or otherwise post-processed after the search is completed. These steps significantly increase the compute requirements and complexity of the architecture search and model deployment. In this work, we propose BigNAS, an approach that challenges the conventional wisdom that post-processing of the weights is necessary to get good prediction accuracies. Without extra retraining or post-processing steps, we are able to train a single set of shared weights on ImageNet and use these weights to obtain child models whose sizes range from 200 to 1000 MFLOPs. Our discovered model family, BigNASModels, achieve top-1 accuracies ranging from 76.5% to 80.9%, surpassing state-of-the-art models in this range including EfficientNets and Once-for-All networks without extra retraining or post-processing. We present ablative study and analysis to further understand the proposed BigNASModels. View details
    Preview abstract Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we propose a more sample-efficient pre-training task called replaced token detection. Instead of masking the input, our approach corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments demonstrate this new pre-training task is more efficient than MLM because the task is defined over all input tokens rather than just the small subset that was masked out. As a result, the contextual representations learned by our approach substantially outperform the ones learned by BERT given the same model size, data, and compute. The gains are particularly strong for small models; for example, we train a model on one GPU for 4 days that outperforms GPT (trained using 30x more compute) on the GLUE natural language understanding benchmark. Our approach also works well at scale, where it performs comparably to RoBERTa and XLNet while using less than 1/4 of their compute and outperforms them when using the same amount of compute. View details
    PyGlove: Symbolic Programming for Automated Machine Learning
    Daiyi Peng
    Xuanyi Dong
    Yifeng Lu
    Hanxiao Liu
    Gabriel Bender
    Adam Kraft
    Chen Liang
    Neural Information Processing Systems (NeurIPS)(2020)
    Preview abstract Neural networks are sensitive to hyper-parameter and architecture choices. Automated Machine Learning (AutoML) is a promising paradigm for automating these choices. Current ML software libraries, however, are quite limited in handling the dynamic interactions among the components of AutoML. For example, efficientNAS algorithms, such as ENAS and DARTS, typically require an implementation coupling between the search space and search algorithm, the two key components in AutoML. Furthermore, implementing a complex search flow, such as searching architectures within a loop of searching hardware configurations, is difficult. To summarize, changing the search space, search algorithm, or search flow in current ML libraries usually requires a significant change in the program logic. In this paper, we introduce a new way of programming AutoML based on symbolic programming. Under this paradigm, ML programs are mutable, thus can be manipulated easily by another program. As a result, AutoML can be reformulated as an automated process of symbolic manipulation. With this formulation, we decouple the triangle of the search algorithm, the search space and the child program. This decoupling makes it easy to change the search space and search algorithm (without and with weight sharing), as well as to add search capabilities to existing code and implement complex search flows. We then introduce PyGlove, a new Python library that implements this paradigm. Through case studies on ImageNet and NAS-Bench-101, we show that with PyGlove users can easily convert a static program into a search space, quickly iterate on the search spaces and search algorithms, and craft complex search flows to achieve better results. View details
    Preview abstract Semi-supervised learning lately has shown much promise in improving deep learning models when labeled data is scarce. Common among recent approaches is the use of consistency training on a large amount of unlabeled data to constrain model predictions to be invariant to input noise. In this work, we present a new perspective on how to effectively noise unlabeled examples and argue that the quality of noising, specifically those produced by advanced data augmentation methods, plays a crucial role in semi-supervised learning. By substituting simple noising operations with advanced data augmentation methods such as RandAugment and back-translation, our method brings substantial improvements across six language and three vision tasks under the same consistency training framework. On the IMDb text classification dataset, with only 20 labeled examples, our method achieves an error rate of 4.20, outperforming the state-of-the-art model trained on 25,000 labeled examples. On a standard semi-supervised learning benchmark, CIFAR-10, our method outperforms all previous approaches and achieves an error rate of 5.43 with only 250 examples. Our method also combines well with transfer learning, e.g., when finetuning from BERT, and yields improvements in high-data regime, such as ImageNet, whether when there is only 10% labeled data or when a full labeled set with 1.3M extra unlabeled examples is used. Code is available at https://github.com/google-research/uda. View details
    Improving 3D Object Detection through Progressive Population Based Augmentation
    Shuyang Cheng
    Zhaoqi Leng
    Barret Richard Zoph
    Chunyan Bai
    Jiquan Ngiam
    Yang Song
    Vijay Vasudevan
    Jon Shlens
    Drago Anguelov
    ECCV'2020
    Preview abstract Data augmentation has been widely adopted for object detection in 3-D point clouds. All efforts have focused on manually designing specific data augmentation methods for individual architectures, however no work has attempted to automate the design of data augmentation in 3-D detection problems -- as is common in 2-D camera-based computer vision. In this work, we present a first attempt to automate the design of data augmentation policies for 3-D object detection. We describe an algorithm termed Progressive Population Based Augmentation (PPBA). PPBA learns to optimize augmentation strategies by narrowing down the search space, and adopting the best parameters discovered in previous iterations. On the KITTI test set, PPBA improves the StarNet by substantial margins on the moderate difficulty category of cars, pedestrians, and cyclists, outperforming all current state-of-the-art single-stage detection models. Additional experiments on the Waymo Open Dataset, a 20x larger dataset compared to KITTI, indicate that PPBA continues to effectively improve 3D object detection. The magnitude of the improvements may be comparable to advances in 3-D perception architectures, yet data augmentation incurs no cost at inference time. In subsequent experiments, we find that PPBA may be up to 10x more data efficient on baseline 3D detection models without augmentation, highlighting that 3D detection models may achieve competitive accuracy with far fewer labeled examples. View details
    Preview abstract We introduce Electric, an energy-based cloze model for representation learning over text. Like BERT, it is a conditional generative model of tokens given their contexts. However, Electric does not use masking or output a full distribution over tokens that could occur in a context. Instead, it assigns a scalar energy score to each input token indicating how likely it is given its context. We train Electric using an algorithm based on noise-contrastive estimation and elucidate how this learning objective is closely related to the recently proposed ELECTRA pre-training method. Electric performs well when transferred to downstream tasks and is particularly effective at producing likelihood scores for text: it reranks speech recognition n-best lists better than language models and much faster than masked language models. Furthermore, it offers a clearer and more principled view of what ELECTRA learns during pre-training. View details
    Preview abstract We present a simple self-training method that achieves 88.4% top-1 accuracy on ImageNet, which is 2.0% better than the state-of-the-art model that requires 3.5B weakly labeled Instagram images. On robustness test sets, it improves ImageNet-A top-1 accuracy from 61.0% to 83.7%, reduces ImageNet-C mean corruption error from 45.7 to 28.3, and reduces ImageNet-P mean flip rate from 27.8 to 12.2. To achieve this result, we first train an EfficientNet model on labeled ImageNet images and use it as a teacher to generate pseudo labels on 300M unlabeled images. We then train a larger EfficientNet as a student model on the combination of labeled and pseudo labeled images. We iterate this process by putting back the student as the teacher. During the generation of the pseudo labels, the teacher is not noised so that the pseudo labels are as accurate as possible. However, during the learning of the student, we inject noise such as dropout, stochastic depth and data augmentation via RandAugment to the student so that the student generalizes better than the teacher. View details
    Natural Questions: a Benchmark for Question Answering Research
    Olivia Redfield
    Danielle Epstein
    Illia Polosukhin
    Matthew Kelcey
    Jacob Devlin
    Llion Jones
    Ming-Wei Chang
    Jakob Uszkoreit
    Slav Petrov
    Transactions of the Association of Computational Linguistics(2019) (to appear)
    Preview abstract We present the Natural Questions corpus, a question answering dataset. Questions consist of real anonymized, aggregated queries issued to the Google search engine. An annotator is presented with a question along with a Wikipedia page from the top 5 search results, and annotates a long answer (typically a paragraph) and a short answer (one or more entities) if present on the page, or marks null if no long/short answer is present. The public release consists of 307,373 training examples with single annotations, 7,830 examples with 5-way annotations for development data, and a further 7,842 examples 5-way annotated sequestered as test data. We present experiments validating quality of the data. We also describe analysis of 25-way annotations on 302 examples, giving insights into human variability on the annotation task. We introduce robust metrics for the purposes of evaluating question answering systems; demonstrate high human upper bounds on these metrics; and establish baseline results using competitive methods drawn from related literature. View details
    Preview abstract In this paper, we take a closer look at data augmentation for images, and describe a simple procedure called AutoAugment to search for improved data augmentation policies. Our key insight is to create a search space of data augmentation policies, evaluating the quality of a particular policy directly on the dataset of interest. In our implementation, we have designed a search space where a policy consists of many sub-policies, one of which is randomly chosen for each image in each mini-batch. A sub-policy consists of two operations, each operation being an image processing function such as translation, rotation, or shearing, and the probabilities and magnitudes with which the functions are applied. We use a search algorithm to find the best policy such that the neural network yields the highest validation accuracy on a target dataset. Our method achieves state-of-the-art accuracy on CIFAR-10, CIFAR-100, SVHN, and ImageNet (without additional data). On ImageNet, we attain a Top-1 accuracy of 83.54%. On CIFAR-10, we achieve an error rate of 1.48%, which is 0.65% better than the previous state-of-the-art. Finally, policies learned from one dataset can be transferred to work well on other similar datasets. For example, the policy learned on ImageNet allows us to achieve state-of-the-art accuracy on the fine grained visual classification dataset Stanford Cars, without fine-tuning weights pre-trained on additional data. Code to train Wide-ResNet, Shake-Shake and ShakeDrop models with AutoAugment policies can be found at https://github.com/tensorflow/models/tree/master/research/autoaugment View details
    Preview abstract Scaling up deep neural network capacity has been known as an effective approach to improving model quality for several different machine learning tasks. In many cases, increasing model capacity beyond the memory limit of a single accelerator has required developing special algorithms or infrastructure. These solutions are often architecture-specific and do not transfer to other tasks. To address the need for efficient and task-independent model parallelism, we introduce GPipe, a pipeline parallelism library that allows scaling any network that can be expressed as a sequence of layers. By pipelining different sub-sequences of layers on separate accelerators, GPipe provides the flexibility of scaling a variety of different networks to gigantic sizes efficiently. Moreover, GPipe utilizes a novel batch-splitting pipelining algorithm, resulting in almost linear speedup when a model is partitioned across multiple accelerators. We demonstrate the advantages of GPipe by training large-scale neural networks on two different tasks with distinct network architectures: (i) Image Classification: We train a 557-million-parameter AmoebaNet model and attain a top-1 accuracy of 84.4% on ImageNet-2012, (ii) Multilingual Neural Machine Translation: We train a single 6-billion-parameter, 128-layer Transformer model on a corpus spanning over 100 languages and achieve better quality than all bilingual models. View details
    Preview abstract Predicting future video frames is extremely challenging, as there are many factors of variation that make up the dynamics of how frames change through time. Previously proposed solutions require complex network architectures and highly specialized computation, including segmentation masks, optical flow, and foreground and background separation. In this work, we question if such handcrafted architectures are necessary and instead propose a different approach: maximizing the capacity of a standard convolutional neural network. We perform the first large-scale empirical study of the effect of capacity on video prediction models. In our experiments, we demonstrate our results on three different datasets: one for modeling object interactions, one for modeling human motion, and one for modeling first-person car driving. View details
    The Evolved Transformer
    David Richard So
    Chen Liang
    ICML(2019) (to appear)
    Preview abstract Recent works have highlighted the strength of the Transformer architecture on sequence tasks while, at the same time, neural architecture search (NAS) has begun to outperform human-designed models. Our goal is to apply NAS to search for a better alternative to the Transformer. We first construct a large search space inspired by the recent advances in feed-forward sequence models and then run evolutionary architecture search with warm starting by seeding our initial population with the Transformer. To directly search on the computationally expensive WMT 2014 English-German translation task, we develop the Progressive Dynamic Hurdles method, which allows us to dynamically allocate more resources to more promising candidate models. The architecture found in our experiments -- the Evolved Transformer -- demonstrates consistent improvement over the Transformer on four well-established language tasks: WMT 2014 English-German, WMT 2014 English-French, WMT 2014 English-Czech and LM1B. At a big model size, the Evolved Transformer establishes a new state-of-the-art BLEU score of 29.8 on WMT'14 English-German; at smaller sizes, it achieves the same quality as the original "big" Transformer with 37.6% less parameters and outperforms the Transformer by 0.7 BLEU at a mobile-friendly model size of 7M parameters. View details
    Saccader: Accurate, interpretable image classification with hard attention
    Simon Kornblith
    2019 Conference on Neural Information Processing Systems (NeurIPS)(2019)
    Preview abstract Deep convolutional networks have achieved high accuracy on image classification tasks. Due to the complexity of these models, they are considered black boxes as the decisions made by these models are hard to interpret. This lack of interpretation have plagued the wide use of these models in critical application. One class of models that offers interpretations by design are those that use hard attention mechanisms. The training of these models without attention supervision is often challenging, resulting in low accuracy and poor attention locations. The difficulty stems from the fact that it is hard to quantify what is salient places in an image. Thus, these models are often trained by RL losses such as REINFORCE. In large scale images such as ImageNet, the action space is high dimensional and the reward is sparse which lead to the optimization to fail. Here we propose a novel model (Saccader) with hard attention mechanism that make discrete attention actions. We also propose a self supervised pretraining procedure that initializes the model to a state with more frequent rewards. We show that our model achieves high accuracy on ImageNet while providing interpretable decisions. View details
    Attention Augmented Convolutional Networks
    Irwan Bello
    Barret Zoph
    Ashish Vaswani
    Jon Shlens
    ICCV(2019)
    Preview abstract Convolutional networks have been the paradigm of choice in many computer vision applications. The convolution operation however has a significant weakness in that it only operates on a local neighborhood, thus missing global information. Self-attention, on the other hand, has emerged as a recent advance to capture long range interactions, but has mostly been applied to sequence modeling and generative modeling tasks. In this paper, we consider the use of self-attention for discriminative visual tasks as an alternative to convolutions. We introduce a novel two-dimensional relative self-attention mechanism that proves competitive in replacing convolutions as a stand-alone computational primitive for image classification. We find in control experiments that the best results are obtained when combining both convolutions and self-attention. We therefore propose to augment convolutional operators with this self-attention mechanism by concatenating convolutional feature maps with a set of feature maps produced via self-attention. Extensive experiments show that Attention Augmentation leads to consistent improvements in image classification on ImageNet and object detection on COCO across many different models and scales, including ResNets and a state-of-the art mobile constrained network, while keeping the number of parameters similar. In particular, our method achieves a 1.3% top-1 accuracy improvement on ImageNet classification over a ResNet50 baseline and outperforms other attention mechanisms for images such as Squeeze-and-Excitation. It also achieves an improvement of 1.4 mAP in COCO Object Detection on top of a RetinaNet baseline. View details
    Preview abstract Convolutional Neural Networks (ConvNets) are commonly developed at a fixed resource budget, and then scaled up for better accuracy if more resources are available. In this paper, we systematically study model scaling and identify that carefully balancing network depth, width, and resolution can lead to better performance. Based on this observation, we propose a new scaling method that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient. We demonstrate the effectiveness of this method on scaling up MobileNets and ResNet. To go even further, we use neural architecture search to design a new baseline network and scale it up to obtain a family of models, called EfficientNets, which achieve much better accuracy and efficiency than previous ConvNets. In particular, our EfficientNet-B7 achieves state-of-the-art 84.4% top-1 / 97.1% top-5 accuracy on ImageNet, while being 8.4x smaller and 6.1x faster on inference than the best existing ConvNet. Our EfficientNets also transfer well and achieve state-of-the-art accuracy on CIFAR-100 (91.7%), Flowers (98.8%), and 3 other transfer learning datasets, with an order of magnitude fewer parameters. View details
    SpecAugment: A Simple Augmentation Method for Automatic Speech Recognition
    Daniel S. Park
    William Chan
    Yu Zhang
    Chung-Cheng Chiu
    Barret Zoph
    INTERSPEECH(2019) (to appear)
    Preview abstract We present SpecAugment, a simple data augmentation method for speech recognition. SpecAugment is applied directly to the feature inputs of a neural network (i.e., filterbanks). The augmentation policy consists of warping the features, masking blocks of frequencies, and masking blocks of time steps. We apply SpecAugment on Listen, Attend and Spell networks for end-to-end speech recognition tasks. We achieve state-of-the-art performance on the Librispeech 960h and Swichboard 300h tasks, outperforming all prior work. On LibriSpeech, we achieve 6.8% WER on test-other without the use of a language model, and 5.8% WER with language model rescoring. This compares to the previous state-of-the-art hybrid system of 7.5% WER. For Switchboard, we achieve 7.2%/15.4% on the Switchboard/CallHome portion of the Hub5'00 test set without the use of a language model, which compares to the previous state-of-the-art hybrid system at 8.3%/17.3% WER. View details
    Mixtape: Breaking the Softmax Bottleneck Efficiently
    Zhilin Yang
    Ruslan Salakhutdinov
    Advances in Neural Information Processing Systems(2019)
    Preview abstract The softmax bottleneck has been shown to limit the expressiveness of neural language models. Mixture of Softmaxes (MoS) is an effective approach to address such a theoretical limitation, but are expensive compared to softmax in terms of both memory and time. We propose Mixtape, an output layer that breaks the softmax bottleneck more efficiently with three novel techniques—logit space vector gating, sigmoid tree decomposition, and gate sharing. On four benchmarks including language modeling and machine translation, the Mixtape layer substantially improves the efficiency over the MoS layer by 3.5x to 10.5x while obtaining similar performance. A network equipped with Mixtape is only 20% to 34% slower than a softmax-based network with 10-30K vocabulary sizes, and outperforms softmax in perplexity and translation quality. View details
    Preview abstract Depthwise convolution is becoming increasingly popular in modern efficient ConvNets, but its kernel size is often overlooked. In this paper, we systematically study the impact of different kernel sizes, and observe that combining the benefits of multiple kernel sizes can lead to better accuracy and efficiency. Based on this observation, we propose a new mixed depthwise convolution (MixConv), which naturally mixes up multiple kernel sizes in a single convolution. As a simple drop-in replacement of vanilla depthwise convolution, our MixConv improves the accuracy and efficiency for existing MobileNets on both ImageNet classification and COCO object detection. To demonstrate the effectiveness of MixConv, we integrate it into AutoML search space and develop a new family of models, named as MixNets, which outperform previous mobile models including MobileNetV2 [20] (ImageNet top-1 accuracy +4.2%), ShuffleNetV2 [16] (+3.5%), MnasNet [26] (+1.3%), ProxylessNAS [2] (+2.2%), and FBNet [27] (+2.0%). In particular, our MixNet-L achieves a new state-of-the-art 78.9% ImageNet top-1 accuracy under typical mobile settings (<600M FLOPS). View details
    Preview abstract It can be challenging to train multi-task neural networks that outperform or even match their single-task counterparts. To help address this, we propose using knowledge distillation where single-task models teach a multi-task model. We enhance this training with teacher annealing, a novel method that gradually transitions the model from distillation to supervised learning, helping the multi-task model surpass its single-task teachers. We evaluate our approach by multi-task fine-tuning BERT on the GLUE benchmark. Our method consistently improves over standard single-task and multi-task training. View details
    Preview abstract Designing convolutional neural networks (CNN) for mobile devices is challenging because mobile models need to be small and fast, yet still accurate. Although significant efforts have been dedicated to design and improve mobile CNNs on all dimensions, it is very difficult to manually balance these trade-offs when there are so many architectural possibilities to consider. In this paper, we propose an automated mobile neural architecture search (MNAS) approach, which explicitly incorporate model latency into the main objective so that the search can identify a model that achieves a good trade-off between accuracy and latency. Unlike previous work, where latency is considered via another, often inaccurate proxy (e.g., FLOPS), our approach directly measures real-world inference latency by executing the model on mobile phones. To further strike the right balance between flexibility and search space size, we propose a novel factorized hierarchical search space that encourages layer diversity throughout the network. Experimental results show that our approach consistently outperforms state-of-the-art mobile CNN models across multiple vision tasks. On the ImageNet classification task, our MnasNet achieves 75.2% top-1 accuracy with 78ms latency on a Pixel phone, which is 1.8× faster than MobileNetV2 [29] with 0.5% higher accuracy and 2.3× faster than NASNet [36] with 1.2% higher accuracy. Our MnasNet also achieves better mAP quality than MobileNets for COCO object detection. View details
    Preview abstract Transfer learning has become a cornerstone of computer vision with the advent of ImageNet features, yet little work has been done to evaluate the performance of ImageNet architectures across different datasets. An implicit hypothesis in modern computer vision research is that models that perform better on ImageNet necessarily perform better on other vision tasks. However, this hypothesis has never been systematically tested. Here, we compare the performance of 13 classification models on 12 image classification tasks in three settings: as fixed feature extractors, fine-tuned, and trained from random initialization. We find that, when networks are used as fixed feature extractors, ImageNet accuracy is only weakly predictive of accuracy on other tasks (r2=0.24). In this setting, ResNets consistently outperform networks that achieve higher accuracy on ImageNet. When networks are fine-tuned, we observe a substantially stronger correlation (r2=0.86). We achieve state-of-the-art performance on eight image classification tasks simply by fine-tuning state-of-the-art ImageNet architectures, outperforming previous results based on specialized methods for transfer learning. Finally, we observe that, on three small fine-grained image classification datasets, networks trained from random initialization perform similarly to ImageNet-pretrained networks. Together, our results show that ImageNet architectures generalize well across datasets, with small improvements in ImageNet accuracy producing improvements across other tasks, but ImageNet features are less general than previously suggested. View details
    Preview abstract We investigate how the behavior of stochastic gradient descent is influenced by model size. By studying families of models obtained by increasing the number of channels in a base network, we examine how the optimal hyperparameters---the batch size and learning rate at which the test error is minimized---correlate with the network width. We find that the optimal "normalized noise scale," which we define to be a function of the batch size, learning rate and the initialization conditions, is proportional to the number of channels (in the absence of batch normalization). This conclusion holds for MLPs, ConvNets and ResNets. A surprising consequence is that if we wish to maintain optimal performance as the network width increases, we must use increasingly small batch sizes. Based on our experiments, we also conjecture that there may be a critical width, beyond which the optimal performance of networks trained with constant SGD ceases to improve unless additional regularization is introduced. View details
    Can Deep Reinforcement Learning Solve Erdos-Selfridge-Spencer Games?
    Maithra Raghu
    Jacob Andreas
    Robert Kleinberg
    Jon Kleinberg
    ICML(2018)
    Preview abstract Deep reinforcement learning has seen many remarkable successes over the past few years. However, progress is hindered by challenges faced in reinforcement learning, such as large variability in performance, catastrophic forgetting, and overfitting to particular states. We propose Erdos-Selfridge-Spencer games as a reinforcement learning testbed. We focus in particular on one of the best-known games in this genre, Spencer’s attacker-defender game, also known as the “tenure game”. This game has several nice properties: it is (i) a low-dimensional, simply parametrized environment where (ii) there is a linear closed form solution for optimal behavior from any state, and (iii) the difficulty of the game can be tuned by changing environment parameters in an interpretable way. We compare several RL methods to the tenure game, examining their performance given varying environment difficulty and their generalization to environments outside the training set. View details
    Preview abstract We propose Efficient Neural Architecture Search (ENAS), a fast and inexpensive approach for automatic model design. ENAS constructs a large computational graph, where each subgraph represents a neural network architecture, hence forcing all architectures to share their parameters. A controller is trained with policy gradient to search for a subgraph that maximizes the expected reward on a validation set. Meanwhile a model corresponding to the selected subgraph is trained to minimize a canonical cross entropy loss. Sharing parameters among child models allows ENAS to deliver strong empirical performances, whilst using much fewer GPU-hours than existing automatic model design approaches, and notably, 1000x less expensive than standard Neural Architecture Search. On Penn Treebank, ENAS discovers a novel architecture that achieves a test perplexity of 56.3, on par with the existing state-of-the-art among all methods without post-training processing. On CIFAR-10, ENAS finds a novel architecture that achieves 2.89% test error, which is on par with the 2.65% test error of NASNet (Zoph et al., 2018). View details
    Preview abstract It is common practice to decay the learning rate. Here we show one can usually obtain the same learning curve on both training and test sets by instead increasing the batch size during training. This procedure is successful for stochastic gradient descent (SGD), SGD with momentum, Nesterov momentum, and Adam. It reaches equivalent test accuracies after the same number of training epochs, but with fewer parameter updates, leading to greater parallelism and shorter training times. We can further reduce the number of parameter updates by increasing the learning rate $\epsilon$ and scaling the batch size $B \propto \epsilon$. Finally, one can increase the momentum coefficient $m$ and scale $B \propto 1/(1-m)$, although this tends to slightly reduce the test accuracy. Crucially, our techniques allow us to repurpose existing training schedules for large batch training with no hyper-parameter tuning. We train Inception-ResNet-V2 on ImageNet to $77\%$ validation accuracy in under 2500 parameter updates, efficiently utilizing training batches of 65536 images. View details
    Preview abstract Current end-to-end Q&A models are primarily based on recurrent neural networks with attention. Despite their success, these models are often slow for both training and inference. We propose a novel Q&A model that does not require recurrent networks yet achieves equivalent or better performance than existing models. Our model is simple in that it consists exclusively of attention and convolutions. We present a thorough study of architectural choices that improve the accuracy of this simple model. We also propose a novel data augmentation technique that not only enhances the training examples but also diversifies the phrasing of the sentences. It results in immediate improvement in the accuracy. This technique is of independent interest that it can be readily applied to other natural language processing tasks. On the SQuAD dataset, our model is 3x faster in training and 10x faster in inference. The model achieves 82.2 F1 score on the development set, which is on par with best documented result of 81.8. View details
    Preview abstract There is growing interest in automating neural network architecture design. Existing architecture search methods can be computationally expensive, requiring thousands of different architectures to be trained from scratch. Recent work has explored weight sharing across models to amortize the cost of training. Although previous methods reduced the cost of architecture search by orders of magnitude, they remain complex, requiring hypernetworks or reinforcement learning controllers. We aim to understand weight sharing for one-shot architecture search. With careful experimental analysis, we show that it is possible to efficiently identify promising architectures from a complex search space without either hypernetworks or RL. View details
    Preview abstract It is becoming increasingly clear that many machine learning classifiers are vulnerable to adversarial examples. In attempting to explain the origin of adversarial examples, previous studies have typically focused on the fact that neural networks operate on high dimensional data, they overfit, or they are too linear. Here we argue that the origin of adversarial examples is primarily due to an inherent uncertainty that neural networks have about their predictions. We show that the functional form of this uncertainty is independent of architecture, dataset, and training protocol; and depends only on the statistics of the logit differences of the network, which do not change significantly during training. This leads to adversarial error having a universal scaling, as a power-law, with respect to the size of the adversarial perturbation. We show that this universality holds for a broad range of datasets (MNIST, CIFAR10, ImageNet, and random data), models (including state-of-theart deep networks, linear models, adversarially trained networks, and networks trained on randomly shuffled labels), and attacks (FGSM, step l.l., PGD). Motivated by these results, we study the effects of reducing prediction entropy on adversarial robustness. Finally, we study the effect of network architectures on adversarial sensitivity. To do this, we use neural architecture search with reinforcement learning to find adversarially robust architectures on CIFAR10. Our resulting architecture is more robust to white and black box attacks compared to previous attempts. View details
    Memory Augmented Policy Optimization for Program Synthesis with Generalization
    Chen Liang
    Mohammad Norouzi
    Jonathan Berant
    Ni Lao
    NeurIPS (Spotlight)(2018)
    Preview abstract We present Memory Augmented Policy Optimization (MAPO), a simple and novel way to leverage a memory buffer of promising trajectories to reduce the variance of policy gradient estimate. MAPO is applicable to deterministic environments with discrete actions, such as structured prediction and combinatorial optimization tasks. We express the expected return objective as a weighted sum of two terms: an expectation over the high-reward trajectories inside the memory buffer, and a separate expectation over trajectories outside the buffer. To make an efficient algorithm of MAPO, we propose: (1) memory weight clipping to accelerate and stabilize training; (2) systematic exploration to discover high-reward trajectories; (3) distributed sampling from inside and outside of the memory buffer to scale up training. MAPO improves the sample efficiency and robustness of policy gradient, especially on tasks with sparse rewards. We evaluate MAPO on weakly supervised program synthesis from natural language (semantic parsing). On the WikiTableQuestions benchmark, we improve the state-of-the-art by 2.6%, achieving an accuracy of 46.3%. On the WikiSQL benchmark, MAPO achieves an accuracy of 74.9% with only weak supervision, outperforming several strong baselines with full supervision. Our source code is available at https://goo.gl/TXBp4e View details
    Preview abstract Despite recent advances in training recurrent neural networks (RNNs), capturing long-term dependencies in sequences remains a fundamental challenge. Most approaches use backpropagation through time (BPTT), which is difficult to scale to very long sequences. This paper proposes a simple method that improves the ability to capture long term dependencies in RNNs by adding an unsupervised auxiliary loss to the original objective. This auxiliary loss forces RNNs to either reconstruct previous events or predict next events in a sequence, making truncated backpropagation feasible for long sequences and also improving full BPTT. We evaluate our method on a variety of settings, including pixel-by-pixel image classification with sequence lengths up to 16\,000, and a real document classification benchmark. Our results highlight good performance and resource efficiency of this approach over competitive baselines, including other recurrent models and a comparable sized Transformer. Further analyses reveal beneficial effects of the auxiliary loss on optimization and regularization, as well as extreme cases where there is little to no backpropagation. View details
    AirDialogue: An Environment for Goal-Oriented Dialogue Research
    Wei Wei
    Jia Li
    Empirical Methods in Natural Language Processing (EMNLP)(2018)
    Preview abstract Recent progress in dialogue generation has inspired a number of studies on dialogue systems that are capable of accomplishing tasks through natural language interactions. A promising direction among these studies is the use of reinforcement learning techniques, such as self-play, for training dialogue agents. However, current datasets are limited in size, and the environment for training agents and evaluating process is relatively unsophisticated. We present AirDialogue, a large dataset that contains 402,038 goal-oriented conversations. To collect this dataset, we create a contextgenerator which provides travel and flight restrictions. We then ask human annotators to play the role of a customer or an agent and interact with the goal of successfully booking a trip given the restrictions. Key to our environment is the ease of evaluating the success of the dialogue, which is achieved by using ground-truth states (eg, the flight being booked) generated by the restrictions. Any dialogue agent that does not generate the correct states is considered to fail. Our experimental results indicate that state-of-the-art dialogue models on the test dataset can only achieve a scaled score of 0.22 and an exact match score of 0.1 while humans can reach a score of 0.94 and 0.93 respectively, which suggests significant opportunities for future improvement. View details
    Learning Transferable Architectures for Scalable Image Recognition
    Barret Zoph
    Vijay Vasudevan
    Jonathon Shlens
    CVPR (Spotlight Presentation)(2018)
    Preview abstract Developing neural network image classification models often requires significant architecture engineering. In this paper, we study a method to learn the model architectures directly on the dataset of interest. As this approach is expensive when the dataset is large, we propose to search for an architectural building block on a small dataset and then transfer the block to a larger dataset. The key contribution of this work is the design of a new search space (the "NASNet search space") which enables transferability. In our experiments, we search for the best convolutional layer (or "cell") on the CIFAR-10 dataset and then apply this cell to the ImageNet dataset by stacking together more copies of this cell, each with their own parameters to design a convolutional architecture, named "NASNet architecture". We also introduce a new regularization technique called ScheduledDropPath that significantly improves generalization in the NASNet models. On CIFAR-10 itself, NASNet achieves 2.4% error rate, which is state-of-the-art. On ImageNet, NASNet achieves, among the published works, state-of-the-art accuracy of 82.7% top-1 and 96.2% top-5 on ImageNet. Our model is 1.2% better in top-1 accuracy than the best human-invented architectures while having 9 billion fewer FLOPS - a reduction of 28% in computational demand from the previous state-of-the-art model. When evaluated at different levels of computational cost, accuracies of NASNets exceed those of the state-of-the-art human-designed models. For instance, a small version of NASNet also achieves 74% top-1 accuracy, which is 3.1% better than equivalently-sized, state-of-the-art models for mobile platforms. Finally, the learned features by NASNet used with the Faster-RCNN framework surpass state-of-the-art by 4.0% achieving 43.1% mAP on the COCO dataset. View details
    Hierarchical Planning for Device Placement
    Azalia Mirhoseini
    Anna Goldie
    Hieu Pham
    Benoit Steiner
    ICLR(2018)
    Preview abstract We introduce a hierarchical model for efficient placement of computational graphs onto hardware devices, especially in heterogeneous environments with a mixture of CPUs, GPUs, and other computational devices. Our method learns to assign graph operations to groups and to allocate those groups to available devices. The grouping and device allocations are learned jointly. The proposed method is trained with policy gradient and requires no human intervention. Experiments with widely-used computer vision and natural language models show that our algorithm can find optimized, non-trivial placements for TensorFlow computational graphs with over 80,000 operations. In addition, our approach outperforms placements by human experts as well as a previous state-of-the-art placement method based on deep reinforcement learning. Our method achieves runtime reductions of up to 60.6% per training step when applied to models such as Neural Machine Translation. View details
    Preview abstract Recent work has argued that stochastic gradient descent can approximate the Bayesian uncertainty in model parameters near local minima. In this work we develop a similar correspondence for minibatch natural gradient descent (NGD). We prove that for sufficiently small learning rates, if the model predictions on the training set approach the true conditional distribution of labels given inputs, the stationary distribution of minibatch NGD approaches a Bayesian posterior near local minima. The temperature T = N/(2B) is controlled by the learning rate , training set size N and batch size B. However minibatch NGD is not parameterisation invariant and it does not sample a valid posterior away from local minima. We therefore propose a novel optimiser, “stochastic NGD”, which introduces the additional correction terms required to preserve both properties. View details
    Preview abstract Deep neural networks often work well when they are over-parameterized and trained with a massive amount of noise and regularization, such as weight decay and dropout. Although dropout is widely used as a regularization technique for fully connected layers, it is often less effective for convolutional layers. This lack of success of dropout for convolutional layers is perhaps due to the fact that activation units in convolutional layers are spatially correlated so information can still flow through convolutional networks despite dropout. Thus a structured form of dropout is needed to regularize convolutional networks. In this paper, we introduce DropBlock, a form of structured dropout, where units in a contiguous region of a feature map are dropped together. We found that applying DropbBlock in skip connections in addition to the convolution layers increases the accuracy. Also, gradually increasing number of dropped units during training leads to better accuracy and more robust to hyperparameter choices. Extensive experiments show that DropBlock works better than dropout in regularizing convolutional networks. On ImageNet classification, ResNet-50 architecture with DropBlock achieves 78.13% accuracy, which is more than 1.6% improvement on the baseline. On COCO detection, DropBlock improves Average Precision of RetinaNet from 36.8% to 38.4%. View details
    Scalable and accurate deep learning for electronic health records
    Alvin Rishi Rajkomar
    Eyal Oren
    Kai Chen
    Nissan Hajaj
    Mila Hardt
    Peter J. Liu
    Xiaobing Liu
    Jake Marcus
    Patrik Per Sundberg
    Kun Zhang
    Yi Zhang
    Gerardo Flores
    Gavin Duggan
    Jamie Irvine
    Kurt Litsch
    Alex Mossin
    Justin Jesada Tansuwan
    De Wang
    Dana Ludwig
    Samuel Volchenboum
    Kat Chou
    Michael Pearson
    Srinivasan Madabushi
    Nigam Shah
    Atul Butte
    npj Digital Medicine(2018)
    Preview abstract Predictive modeling with electronic health record (EHR) data is anticipated to drive personalized medicine and improve healthcare quality. Constructing predictive statistical models typically requires extraction of curated predictor variables from normalized EHR data, a labor-intensive process that discards the vast majority of information in each patient’s record. We propose a representation of patients’ entire raw EHR records based on the Fast Healthcare Interoperability Resources (FHIR) format. We demonstrate that deep learning methods using this representation are capable of accurately predicting multiple medical events from multiple centers without site-specific data harmonization. We validated our approach using de-identified EHR data from two U.S. academic medical centers with 216,221 adult patients hospitalized for at least 24 hours. In the sequential format we propose, this volume of EHR data unrolled into a total of 46,864,534,945 data points, including clinical notes. Deep learning models achieved high accuracy for tasks such as predicting: in-hospital mortality (AUROC across sites 0.93-0.94), 30-day unplanned readmission (AUROC 0.75-0.76), prolonged length of stay (AUROC 0.85-0.86), and all of a patient’s final discharge diagnoses (frequency-weighted AUROC 0.90). These models outperformed state-of-the-art traditional predictive models in all cases. We also present a case-study of a neural-network attribution system, which illustrates how clinicians can gain some transparency into the predictions. We believe that this approach can be used to create accurate and scalable predictions for a variety of clinical scenarios, complete with explanations that directly highlight evidence in the patient’s chart. View details
    Preview abstract Zhang et al. (2016) argued that understanding deep learning requires rethinking generalization. To justify this claim, they showed that deep networks can easily memorize randomly labeled training data, despite generalizing well when shown real labels of the same inputs. We show here that the same phenomenon occurs in small linear models with fewer than a thousand parameters; however there is no need to rethink anything, since our observations are explained by evaluating the Bayesian evidence in favor of each model. This Bayesian evidence penalizes sharp minima. We also explore the “generalization gap” observed between small and large batch training, identifying an optimum batch size which scales linearly with both the learning rate and the size of the training set. Surprisingly, in our experiments the generalization gap was closed by regularizing the model. View details
    Searching for Activation Functions
    Prajit Ramachandran
    Barret Zoph
    ICLR(2018)
    Preview abstract The choice of activation functions in deep networks has a significant effect on the training dynamics and task performance. Currently, the most successful and widely-used activation function is the Rectified Linear Unit (ReLU). Although various hand-designed alternatives to ReLU have been proposed, none have managed to replace it due to inconsistent gains. In this work, we propose to leverage automatic search techniques to discover new activation functions. Using a combination of exhaustive and reinforcement learning-based search, we discover multiple novel activation functions. We verify the effectiveness of the searches by conducting an empirical evaluation with the best discovered activation function. Our experiments show that the best discovered activation function, f(x)=x⋅sigmoid(βx), which we name Swish, tends to work better than ReLU on deeper models across a number of challenging datasets. For example, simply replacing ReLUs with Swish units improves top-1 classification accuracy on ImageNet by 0.9\% for Mobile NASNet-A and 0.6\% for Inception-ResNet-v2. The simplicity of Swish and its similarity to ReLU make it easy for practitioners to replace ReLUs with Swish units in any neural network. View details
    Preview abstract The effort devoted to hand-crafting image classifiers has motivated the use of architecture search to discover them automatically. Although evolutionary algorithms have been repeatedly applied to architecture search, the architectures thus discovered have remained inferior to human-crafted ones. Here we show for the first time that artificially-evolved architectures can match or surpass human-crafted and RL-designed image classifiers. In particular, our models---named AmoebaNets---achieved a state-of-the-art accuracy of 97.87% on CIFAR-10 and top-1 accuracy of 83.1% on ImageNet. Among mobile-size models, an AmoebaNet with only 5.1M parameters also achieved a state-of-the-art top-1 accuracy of 75.1% on ImageNet. We also compared this method against strong baselines. Finally, we performed platform-aware architecture search with evolution to find a model that trains quickly on Google Cloud TPUs. This method produced an AmoebaNet that won the Stanford DAWNBench competition for lowest ImageNet training cost. View details
    Preview abstract We present a simple but effective technique for deep semi-supervised learning. On labeled examples, the model is trained with standard cross-entropy loss. On an unlabeled example, the model first performs inference (acting as a “teacher”) and then learns from the resulting output distribution (acting as a “student”). We deviate from prior work by adding multiple auxiliary student softmax layers to the model. The input to each student layer is a sub-network of the full model that has a restricted view of the input (e.g., only seeing one region of an image). The students can learn from the teacher because the teacher sees more of each example. Concurrently, the students improve the representations used by the teacher as they learn to make predictions with limited data. We propose variants of our method for CNN image classifiers and BiLSTM sequence taggers. When combined with Virtual Adversarial Training, it improves upon the current state-of-the-art on semi-supervised CIFAR-10 and semi-supervised SVHN. We also apply it to train semi-supervised sequence taggers for four Natural Language Processing tasks using hundreds of millions of sentences of unlabeled data. The resulting models improve upon or are competitive with the current state-of-the-art on every task. View details
    Latent Sequence Decompositions
    William Chan
    Yu Zhang
    Navdeep Jaitly
    ICLR(2017)
    Preview abstract We present the Latent Sequence Decompositions (LSD) framework. LSD decomposes sequences with variable lengthed output units as a function of both the input sequence and the output sequence. We present a training algorithm which samples valid extensions and an approximate decoding algorithm. We experiment with the Wall Street Journal speech recognition task. Our LSD model achieves 12.9% WER compared to a character baseline of 14.8% WER. When combined with a convolutional network on the encoder, we achieve 9.6% WER. View details
    Neural Combinatorial Optimization with Reinforcement Learning
    Irwan Bello
    Hieu Pham
    Mohammad Norouzi
    Samy Bengio
    ICLR Workshop(2017)
    Preview abstract This paper presents a framework to tackle combinatorial optimization problems using neural networks and reinforcement learning. We focus on the traveling salesman problem (TSP) and train a recurrent network that, given a set of city coordinates, predicts a distribution over different city permutations. Using negative tour length as the reward signal, we optimize the parameters of the recurrent network using a policy gradient method. We compare learning the network parameters on a set of training graphs against learning them on individual test graphs. Despite the computational expense, without much engineering and heuristic designing, Neural Combinatorial Optimization achieves close to optimal results on 2D Euclidean graphs with up to 100 nodes. Applied to the KnapSack, another NP-hard problem, the same method obtains optimal solutions for instances with up to 200 items. View details
    Neural Optimizer Search with Reinforcement Learning
    Irwan Bello
    Barret Zoph
    Vijay Vasudevan
    ICML(2017)
    Preview abstract We present an approach to automate the process of discovering optimization methods, with a focus on deep learning architectures. We train a Recurrent Neural Network controller to generate a string in a domain specific language that describes a mathematical update equation based on a list of primitive functions, such as the gradient, running average of the gradient, etc. The controller is trained with Reinforcement Learning to maximize the performance of a model after a few epochs. On CIFAR-10, our method discovers several update rules that are better than many commonly used optimizers, such as Adam, RMSProp, or SGD with and without Momentum on a ConvNet model. We introduce two new optimizers, named PowerSign and AddSign, which we show transfer well and improve training on a variety of different tasks and architectures, including ImageNet classification and Google's neural machine translation system. View details
    Preview abstract This work explores hypernetworks: an approach of using a one network, also known as a hypernetwork, to generate the weights for another network. Hypernetworks provide an abstraction that is similar to what is found in nature: the relationship between a genotype - the hypernetwork - and a phenotype - the main network. Though they are also reminiscent of HyperNEAT in evolution, our hypernetworks are trained end-to-end with backpropagation and thus are usually faster. The focus of this work is to make hypernetworks useful for deep convolutional networks and long recurrent networks, where hypernetworks can be viewed as relaxed form of weight-sharing across layers. Our main result is that hypernetworks can generate non-shared weights for LSTM and achieve near state-of-the-art results on a variety of sequence modelling tasks including character-level language modelling, handwriting generation and neural machine translation, challenging the weight-sharing paradigm for recurrent networks. Our results also show that hypernetworks applied to convolutional networks still achieve respectable results for image recognition tasks compared to state-of-the-art baseline models while requiring fewer learnable parameters. View details
    Preview abstract The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation. In practice, however, there are significant algorithmic and performance challenges. In this work, we address these challenges and finally realize the promise of conditional computation, achieving greater than 1000x improvements in model capacity with only minor losses in computational efficiency on modern GPU clusters. We introduce a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks. A trainable gating network determines a sparse combination of these experts to use for each example. We apply the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora. We present model architectures in which a MoE with up to 137 billion parameters is applied convolutionally between stacked LSTM layers. On large language modeling and machine translation benchmarks, these models achieve significantly better results than state-of-the-art at lower computational cost. View details
    Learning to Skim Text
    Adams Wei Yu
    Hongrae Lee
    ACL(2017)
    Preview abstract Recurrent Neural Networks are showing much promise in many sub-areas of natural language processing, ranging from document classification to machine translation to automatic question answering. Despite their promise, many recurrent models have to read the whole text word by word, making it slow to handle long documents. For example, it is difficult to use a recurrent network to read a book and answer questions about it. In this paper, we present an approach of reading text while skipping irrelevant information if needed. The underlying model is a recurrent network that learns how far to jump after reading a few words of the input text. We employ a standard policy gradient method to train the model to make discrete jumping decisions. In our benchmarks on four different tasks, including number prediction, sentiment analysis, news article classification and automatic Q\&A, our proposed model, a modified LSTM with jumping, is up to 6 times faster than the standard sequential LSTM, while maintaining the same or even better accuracy. View details
    Preview abstract Neural Machine Translation (NMT) has shown remarkable progress over the past few years with production systems now being deployed to end-users. One major drawback of current architectures is that they are expensive to train, typically requiring days to weeks of GPU time to converge. This makes exhaustive hyperparameter search, as is commonly done with other neural network architectures, prohibitively expensive. In this work, we present the first large-scale analysis of NMT architecture hyperparameters. We report empirical results and variance numbers for several hundred experimental runs, corresponding to over 250,000 GPU hours on the standard WMT English to German translation task. Our experiments lead to novel insights and practical advice for building and extending NMT architectures. As part of this contribution, we release an open-source NMT framework that enables researchers to easily experiment with novel techniques and reproduce state of the art results. View details
    Preview abstract Neural networks are powerful and flexible models that work well for many difficult learning tasks in image, speech and natural language understanding. Despite their success, neural networks are still hard to design. In this paper, we use a recurrent network to generate the model descriptions of neural networks and train this RNN with reinforcement learning to maximize the expected accuracy of the generated architectures on a validation set. On the CIFAR-10 dataset, our method, starting from scratch, can design a novel network architecture that rivals the best human-invented architecture in terms of test set accuracy. Our CIFAR-10 model achieves a test error rate of 3.84, which is only 0.1 percent worse and 1.2x faster than the current state-of-the-art model. On the Penn Treebank dataset, our model can compose a novel recurrent cell that outperforms the widely-used LSTM cell, and other state-of-the-art baselines. Our cell achieves a test set perplexity of 62.4 on the Penn Treebank, which is 3.6 perplexity better than the previous state-of-the-art. View details
    Neural Symbolic Machines: Learning Semantic Parsers on Freebase with Weak Supervision
    Chen Liang {+crazydonkey}
    Jonathan Berant {+joberant}
    Ken Forbus
    Ni Lao
    Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Vancouver, Canada(2017), pp. 23-33
    Preview abstract Modern semantic parsers, which map natural language utterances to executable logical forms, have been successfully trained over large knowledge bases from weak supervision, but require hand-crafted rules and substantial feature engineering. Recent attempts to train an end-to-end neural network for semantic parsing have either used strong supervision (full logical forms), or have employed synthetic datasets and differentiable operations. In this work, we propose the Boss-Programmer-Computer framework to integrate neural network models with symbolic operations. Within this framework, we introduce Neural Symbolic Machines, in which a sequence-to-sequence neural network "programmer" controls a non-differentiable "computer" that executes Lisp programs (equivalent to logical forms) and provides code assistance. The interaction between the "programmer" and "computer" dramatically reduces the search space and effectively learns the semantic parser from weak supervision over a large knowledge base, such as Freebase. Our model obtained new state-of-the-art performance on \textsc{WebQuestionsSP}, a challenging semantic parsing dataset. View details
    Preview abstract Learning a natural language interface for database tables is a challenging task that involves deep language understanding and multi-step reasoning. The task is often approached by mapping natural language queries to logical forms or programs that provide the desired response when executed on the database. To our knowledge, this paper presents the first weakly supervised, end-to-end neural network model to induce such programs on a real-world dataset. We enhance the objective function of Neural Programmer, a neural network with built-in discrete operations, and apply it on WikiTableQuestions, a natural language question-answering dataset. The model is trained end-to-end with weak supervision of question-answer pairs, and does not require domain-specific grammars, rules, or annotations that are key elements in previous approaches to program induction. The main experimental result in this paper is that a single Neural Programmer model achieves 34.2% accuracy using only 10,000 examples with weak supervision. An ensemble of 15 models, with a trivial combination technique, achieves 37.2% accuracy, which is competitive to the current state-of-the-art accuracy of 37.1% obtained by a traditional natural language semantic parser. View details
    Device Placement Optimization with Reinforcement Learning
    Azalia Mirhoseini
    Hieu Pham
    Mohammad Norouzi
    Samy Bengio
    Benoit Steiner
    Yuefeng Zhou
    Naveen Kumar
    ICML(2017)
    Preview abstract The past few years have seen much success in applying neural networks to many practical problems. Together with this success is the growth in size and computational requirements for training and inference with neural networks. A common approach to address these requirements is to use a heterogeneous distributed environment with a mix of hardware devices such as CPUs, and GPUs. Importantly, the decision of placing parts of the neural models on devices is most often made by a human expert relying on heuristic approaches. In this paper, we propose a method which learns to optimize device placement. Key to our method is the employment of a recurrent neural network to predict a set of device placements for a target neural computation graph. The execution time according to the predicted placements is then used as the reward function to optimize the parameters of the recurrent neural network. Our main result is that on Inception for ImageNet classification, and on LSTM, for language modeling and neural translation, our model finds non-trivial device placements that significantly outperform handcrafted heuristics and traditional algorithmic methods. View details
    Preview abstract Neural networks have proven effective at solving difficult problems but designing their architectures can be challenging, even for image classification problems alone. Evolutionary algorithms provide a technique to discover such networks automatically. Despite significant computational requirements, we show that evolving models that rival large, hand-designed architectures is possible today. We employ simple evolutionary techniques at unprecedented scales to discover models for the CIFAR-10 and CIFAR-100 datasets, starting from trivial initial conditions. To do this, we use novel and intuitive mutation operators that navigate large search spaces. We stress that no human participation is required once evolution starts and that the output is a fully-trained model. Throughout this work, we place special emphasis on the repeatability of results, the variability in the outcomes and the computational requirements. View details
    Preview abstract A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module. Building these components often requires extensive domain expertise and may contain brittle design choices. In this paper, we present Tacotron, an end-to-end generative text-to-speech model that synthesizes speech directly from characters. Given (text, audio) pairs, the model can be trained completely from scratch with random initialization. We present several key techniques to make the sequence-to-sequence framework perform well for this challenging task. Tacotron achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness. In addition, since Tacotron generates speech at the frame level, it's substantially faster than sample-level autoregressive methods. View details
    Multi-task Sequence to Sequence Learning
    Ilya Sutskever
    Lukasz Kaiser
    International Conference on Learning Representations(2016)
    Preview abstract Sequence to sequence learning has recently emerged as a new paradigm in supervised learning. To date, most of its applications focused on only one task and not much work explored this framework for multiple tasks. This paper examines three multi-task learning (MTL) settings for sequence to sequence models: (a) the oneto-many setting - where the encoder is shared between several tasks such as machine translation and syntactic parsing, (b) the many-to-one setting - useful when only the decoder can be shared, as in the case of translation and image caption generation, and (c) the many-to-many setting - where multiple encoders and decoders are shared, which is the case with unsupervised objectives and translation. Our results show that training on a small amount of parsing and image caption data can improve the translation quality between English and German by up to 1.5 BLEU points over strong single-task baselines on the WMT benchmarks. Furthermore, we have established a new state-of-the-art result in constituent parsing with 93.0 F1. Lastly, we reveal interesting properties of the two unsupervised learning objectives, autoencoder and skip-thought, in the MTL context: autoencoder helps less in terms of perplexities but more on BLEU scores compared to skip-thought. View details
    Preview abstract We propose a simple, elegant solution to use a single Neural Machine Translation (NMT) model to translate between multiple languages. Our solution requires no change in the model architecture from our base system but instead introduces an artificial token at the beginning of the input sentence to specify the required target language. The rest of the model, which includes encoder, decoder and attention, remains unchanged and is shared across all languages. Using a shared wordpiece vocabulary, our approach enables Multilingual NMT using a single model without any increase in parameters, which is significantly simpler than previous proposals for Multilingual NMT. Our method often improves the translation quality of all involved language pairs, even while keeping the total number of model parameters constant. On the WMT'14 benchmarks, a single multilingual model achieves comparable performance for English->French and surpasses state-of-the-art results for English->German. Similarly, a single multilingual model surpasses state-of-the-art results for French->English and German->English on WMT'14 and WMT'15 benchmarks respectively. On production corpora, multilingual models of up to twelve language pairs allow for better translation of many individual pairs. In addition to improving the translation quality of language pairs that the model was trained with, our models can also learn to perform implicit bridging between language pairs never seen explicitly during training, showing that transfer learning and zero-shot translation is possible for neural translation. Finally, we show analyses that hints at a universal interlingua representation in our models and show some interesting examples when mixing languages. View details
    A Neural Transducer
    Navdeep Jaitly
    David Sussillo
    Ilya Sutskever
    Samy Bengio
    NIPS 2016(2016)
    Preview abstract Sequence-to-sequence models have achieved impressive results on various tasks. However, they are unsuitable for tasks that require incremental predictions to be made as more data arrives or tasks that have long input sequences and output sequences. This is because they generate an output sequence conditioned on an entire input sequence. In this paper, we present a Neural Transducer that can make incremental predictions as more input arrives, without redoing the entire computation. Unlike sequence-to-sequence models, the Neural Transducer computes the next-step distribution conditioned on the partially observed input sequence and the partially generated sequence. At each time step, the transducer can decide to emit zero to many output symbols. The data can be processed using an encoder and presented as input to the transducer. The discrete decision to emit a symbol at every time step makes it difficult to learn with conventional backpropagation. It is however possible to train the transducer by using a dynamic programming algorithm to generate target discrete decisions. Our experiments show that the Neural Transducer works well in settings where it is required to produce output predictions as data come in. We also find that the Neural Transducer performs well for long sequences even when attention mechanisms are not used. View details
    Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
    Mike Schuster
    Mohammad Norouzi
    Maxim Krikun
    Yuan Cao
    Qin Gao
    Apurva Shah
    Xiaobing Liu
    Łukasz Kaiser
    Stephan Gouws
    Taku Kudo
    Keith Stevens
    George Kurian
    Nishant Patil
    Wei Wang
    Cliff Young
    Jason Smith
    Alex Rudnick
    Macduff Hughes
    CoRR, abs/1609.08144(2016)
    Preview abstract Neural Machine Translation (NMT) is an end-to-end learning approach for automated translation, with the potential to overcome many of the weaknesses of conventional phrase-based translation systems. Unfortunately, NMT systems are known to be computationally expensive both in training and in translation inference. Also, most NMT systems have difficulty with rare words. These issues have hindered NMT's use in practical deployments and services, where both accuracy and speed are essential. In this work, we present GNMT, Google's Neural Machine Translation system, which attempts to address many of these issues. Our model consists of a deep LSTM network with 8 encoder and 8 decoder layers using attention and residual connections. To improve parallelism and therefore decrease training time, our attention mechanism connects the bottom layer of the decoder to the top layer of the encoder. To accelerate the final translation speed, we employ low-precision arithmetic during inference computations. To improve handling of rare words, we divide words into a limited set of common sub-word units ("wordpieces") for both input and output. This method provides a good balance between the flexibility of "character"-delimited models and the efficiency of "word"-delimited models, naturally handles translation of rare words, and ultimately improves the overall accuracy of the system. Our beam search technique employs a length-normalization procedure and uses a coverage penalty, which encourages generation of an output sentence that is most likely to cover all the words in the source sentence. On the WMT'14 English-to-French and English-to-German benchmarks, GNMT achieves competitive results to state-of-the-art. Using a human side-by-side evaluation on a set of isolated simple sentences, it reduces translation errors by an average of 60% compared to Google's phrase-based production system. View details
    Preview abstract We present Listen, Attend and Spell (LAS), a neural network that learns to transcribe speech utterances to characters. Unlike traditional DNN-HMM models, this model learns all the components of a speech recognizer jointly. Our system has two components: a listener and a speller. The listener is a pyramidal recurrent network encoder that accepts filter bank spectra as inputs. The speller is an attention based recurrent network decoder that emits characters as outputs. The network produces character sequences without making any independence assumptions between the characters. This is the key improvement of LAS over previous end-to-end CTC models. On a subset of the Google voice search task, LAS achieves a word error rate (WER) of 14.1% without a dictionary or a language model, and 10.3% with language model rescoring over the top 32 beams. By comparison, the state-of-the-art CLDNN-HMM model achieves a WER of 8.0%. View details
    Neural Combinatorial Optimization with Reinforcement Learning
    Irwan Bello
    Hieu Pham
    Mohammad Norouzi
    Samy Bengio
    ICLR(2016)
    Preview abstract This paper presents a framework to tackle combinatorial optimization problems using neural networks and reinforcement learning. We focus on the traveling salesman problem (TSP) and train a recurrent network that, given a set of city coordinates, predicts a distribution over different city permutations. Using negative tour length as the reward signal, we optimize the parameters of the recurrent network using a policy gradient method. We compare learning the network parameters on a set of training graphs against learning them on individual test graphs. Despite the computational expense, without much engineering and heuristic designing, Neural Combinatorial Optimization achieves close to optimal results on 2D Euclidean graphs with up to 100 nodes. Applied to the KnapSack, another NP-hard problem, the same method obtains optimal solutions for instances with up to 200 items. View details
    Preview abstract This work explores hypernetworks: an approach of using one network, also known as a hypernetwork, to generate the weights for another network. We apply hypernetworks to generate adaptive weights for recurrent networks. In this case, hypernetworks can be viewed as a relaxed form of weight-sharing across layers. In our implementation, hypernetworks are are trained jointly with the main network in an end-to-end fashion. Our main result is that hypernetworks can generate non-shared weights for LSTM and achieve state-of-the-art results on a variety of sequence modelling tasks including character-level language modelling, handwriting generation and neural machine translation, challenging the weight-sharing paradigm for recurrent networks. View details
    Preview abstract Sequence-to-sequence models have achieved impressive results on various tasks. However, they are unsuitable for tasks that require incremental predictions to be made as more data arrives. This is because they generate an output sequence conditioned on an entire input sequence. In this paper, we present a new model that can make incremental predictions as more input arrives, without redoing the entire computation. Unlike sequence-to-sequence models, our method computes the next-step distribution conditioned on the partial input sequence observed and the partial sequence generated. It accomplishes this goal using an encoder recurrent neural network (RNN) that computes features at the same frame rate as the input, and a transducer RNN that operates over blocks of input steps. The transducer RNN extends the sequence produced so far using a local sequence-to-sequence model. During training, our method uses alignment information to generate supervised targets for each block. Approximate alignment is easily available for tasks such as speech recognition, action recognition in videos, etc. During inference (decoding), beam search is used to find the most likely output sequence for an input sequence. This decoding is performed online - at the end of each block, the best candidates from the previous block are extended through the local sequence-to-sequence model. On TIMIT, our online method achieves 19.8% phone error rate (PER). For comparison with published sequence-to-sequence methods, we used a bidirectional encoder and achieved 18.7% PER compared to 17.6% from the best reported sequence-to-sequence model. Importantly, unlike sequence-to-sequence our model is minimally impacted by the length of the input. On artificially created longer utterances, it achieves 20.9% with a unidirectional model, compared to 20% from the best bidirectional sequence-to-sequence models. View details
    Preview abstract Sequence to sequence models are successful tools for supervised sequence learning tasks, such as machine translation. Despite their success, these models still require much labeled data and it is unclear how to improve them using unlabeled data, which is much less expensive to obtain. In this paper, we present simple changes that lead to a significant improvement in the accuracy of seq2seq models when the labeled set is small. Our method intializes the encoder and decoder of the seq2seq model with the trained weights of two language models, and then all weights are jointly fine-tuned with labeled data. An additional language modeling loss can be used to regularize the model during fine-tuning. We apply this method to low-resource tasks in machine translation and abstractive summarization and find that it significantly improves the subsequent supervised models. Our main finding is that the pretraining accelerates training and improves generalization of seq2seq models, achieving state-of-the-art results on the WMT English→German task. Our model obtains an improvement of 1.3 BLEU from the previous best models on both WMT'14 and WMT'15 English→German. Our ablation study shows that pretraining helps seq2seq models in different ways depending on the nature of the task: translation benefits from the improved generalization whereas summarization benefits from the improved optimization. View details
    Neural Programmer: Inducing Latent Programs with Gradient Descent
    Arvind Neelakantan
    Ilya Sutskever
    International Conference on Learning Representations(2016)
    Preview abstract Deep neural networks have achieved impressive supervised classification performance in many tasks including image recognition, speech recognition, and sequence to sequence learning. However, this success has not been translated to applications like question answering that may involve complex arithmetic and logic reasoning. A major limitation of these models is in their inability to learn even simple arithmetic and logic operations. For example, it has been shown that neural networks fail to learn to add two binary numbers reliably. In this work, we propose Neural Programmer, an end-to-end differentiable neural network augmented with a small set of basic arithmetic and logic operations. Neural Programmer can call these augmented operations over several steps, thereby inducing compositional programs that are more complex than the built-in operations. The model learns from a weak supervision signal which is the result of execution of the correct program, hence it does not require expensive annotation of the correct program itself. The decisions of what operations to call, and what data segments to apply to are inferred by Neural Programmer. Such decisions, during training, are done in a differentiable fashion so that the entire network can be trained jointly by gradient descent. We find that training the model is difficult, but it can be greatly improved by adding random noise to the gradient. On a fairly complex synthetic table-comprehension dataset, traditional recurrent networks and attentional models perform poorly while Neural Programmer typically obtains nearly perfect accuracy. View details
    Listen, Attend and Spell
    William Chan
    Navdeep Jaitly
    CoRR, abs/1508.01211(2015)
    Preview abstract We present Listen, Attend and Spell (LAS), a neural network that learns to transcribe speech utterances to characters. Unlike traditional DNN-HMM models, this model learns all the components of a speech recognizer jointly. Our system has two components: a listener and a speller. The listener is a pyramidal recurrent network encoder that accepts filter bank spectra as inputs. The speller is an attention-based recurrent network decoder that emits characters as outputs. The network produces character sequences without making any independence assumptions between the characters. This is the key improvement of LAS over previous end-to-end CTC models. On a subset of the Google voice search task, LAS achieves a word error rate (WER) of 14.1% without a dictionary or a language model, and 10.3% with language model rescoring over the top 32 beams. By comparison, the state-of-the-art CLDNN-HMM model achieves a WER of 8.0%. View details
    A Simple Way to Initialize Recurrent Networks of Rectified Linear Units
    Navdeep Jaitly
    Geoffrey E. Hinton
    CoRR, abs/1504.00941(2015)
    Preview abstract Learning long term dependencies in recurrent networks is difficult due to vanishing and exploding gradients. To overcome this difficulty, researchers have developed sophisticated optimization techniques and network architectures. In this paper, we propose a simpler solution that use recurrent neural networks composed of rectified linear units. Key to our solution is the use of the identity matrix or its scaled version to initialize the recurrent weight matrix. We find that our solution is comparable to a standard implementation of LSTMs on our four benchmarks: two toy problems involving long-range temporal structures, a large language modeling problem and a benchmark speech recognition problem. View details
    Preview abstract Conversational modeling is an important task in natural language understanding and machine intelligence. Although previous approaches exist, they are often restricted to specific domains (e.g., booking an airline ticket) and require hand-crafted rules. In this paper, we present a simple approach for this task which uses the recently proposed sequence to sequence framework. Our model converses by predicting the next sentence given the previous sentence or sentences in a conversation. The strength of our model is that it can be trained end-to-end and thus requires much fewer hand-crafted rules. We find that this straightforward model can generate simple conversations given a large conversational training dataset. Our preliminary results suggest that, despite optimizing the wrong objective function, the model is able to converse well. It is able extract knowledge from both a domain specific dataset, and from a large, noisy, and general domain dataset of movie subtitles. On a domain-specific IT helpdesk dataset, the model can find a solution to a technical problem via conversations. On a noisy open-domain movie transcript dataset, the model can perform simple forms of common sense reasoning. As expected, we also find that the lack of consistency is a common failure mode of our model. View details
    Adding Gradient Noise Improves Learning for Very Deep Networks
    Arvind Neelakantan
    Luke Vilnis
    Ilya Sutskever
    Lukasz Kaiser
    Karol Kurach
    James Martens
    CoRR, abs/1511.06807(2015)
    Preview abstract Deep feedforward and recurrent networks have achieved impressive results in many perception and language processing applications. This success is partially attributed to architectural innovations such as convolutional and long short-term memory networks. The main motivation for these architectural innovations is that they capture better domain knowledge, and importantly are easier to optimize than more basic architectures. Recently, more complex architectures such as Neural Turing Machines and Memory Networks have been proposed for tasks including question answering and general computation, creating a new set of optimization challenges. In this paper, we discuss a low-overhead and easy-to-implement technique of adding gradient noise which we find to be surprisingly effective when training these very deep architectures. The technique not only helps to avoid overfitting, but also can result in lower training loss. This method alone allows a fully-connected 20-layer deep network to be trained with standard gradient descent, even starting from a poor initialization. We see consistent improvements for many complex models, including a 72% relative reduction in error rate over a carefully-tuned baseline on a challenging question-answering task, and a doubling of the number of accurate binary multiplication models learned across 7,000 random restarts. We encourage further application of this technique to additional complex modern architectures. View details
    Semi-supervised sequence learning
    Advances in Neural Information Processing Systems, NIPS(2015)
    Preview abstract We present two approaches that use unlabeled data to improve sequence learning with recurrent networks. The first approach is to predict what comes next in a sequence, which is a conventional language model in natural language processing. The second approach is to use a sequence autoencoder, which reads the input sequence into a vector and predicts the input sequence again. These two algorithms can be used as a “pretraining” step for a later supervised sequence learning algorithm. In other words, the parameters obtained from the unsupervised step can be used as a starting point for other supervised training models. In our experiments, we find that long short term memory recurrent networks after being pretrained with the two approaches are more stable and generalize better. With pretraining, we are able to train long short term recurrent networks up to a few hundred timesteps, thereby achieving strong performance in many text classification tasks, such as IMDB, DBpedia and 20 Newsgroups. View details
    Preview abstract Neural Machine Translation (NMT) is a new approach to machine translation that has shown promising results that are comparable to traditional approaches. A significant weakness in conventional NMT systems is their inability to correctly translate very rare words: end-to-end NMTs tend to have relatively small vocabularies with a single unk symbol that represents every possible out-of-vocabulary (OOV) word. In this paper, we propose and implement an effective technique to address this problem. We train an NMT system on data that is augmented by the output of a word alignment algorithm, allowing the NMT system to emit, for each OOV word in the target sentence, the position of its corresponding word in the source sentence. This information is later utilized in a post-processing step that translates every OOV word using a dictionary. Our experiments on the WMT’14 English to French translation task show that this method provides a substantial improvement of up to 2.8 BLEU points over an equivalent NMT system that does not use this technique. With 37.5 BLEU points, our NMT system is the first to surpass the best result achieved on a WMT’14 contest task. View details
    Preview abstract Deep Neural Networks (DNNs) are powerful models that have achieved excellent performance on difficult learning tasks. Although DNNs work well whenever large labeled training sets are available, they cannot be used to map sequences to sequences. In this paper, we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Our method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. Our main result is that on an English to French translation task from the WMT-14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.8 on the entire test set, where the LSTM's BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did not have difficulty on long sentences. For comparison, a phrase-based SMT system achieves a BLEU score of 33.3 on the same dataset. When we used the LSTM to rerank the 1000 hypotheses produced by the aforementioned SMT system, its BLEU score increases to 36.5, which is close to the previous state of the art. The LSTM also learned sensible phrase and sentence representations that are sensitive to word order and are relatively invariant to the active and the passive voice. Finally, we found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier. View details
    Document embedding with paragraph vectors
    Christopher Olah
    NIPS Deep Learning Workshop(2014)
    Preview abstract Paragraph Vectors has been recently proposed as an unsupervised method for learning distributed representations for pieces of texts. In their work, the authors showed that the method can learn an embedding of movie review texts which can be leveraged for sentiment analysis. That proof of concept, while encouraging, was rather narrow. Here we consider tasks other than sentiment analysis, provide a more thorough comparison of Paragraph Vectors to other document modelling algorithms such as Latent Dirichlet Allocation, and evaluate performance of the method as we vary the dimensionality of the learned representation. We benchmarked the models on two document similarity data sets, one from Wikipedia, one from arXiv. We observe that the Paragraph Vector method performs significantly better than other methods, and propose a simple improvement to enhance embedding quality. Somewhat surprisingly, we also show that much like word embeddings, vector operations on Paragraph Vectors can perform useful semantic results. View details
    Distributed Representations of Sentences and Documents
    Tomas Mikolov
    International Conference on Machine Learning(2014)
    Preview abstract Many machine learning algorithms require the input to be represented as a fixed-length feature vector. When it comes to texts, one of the most common fixed-length features is bag-of-words. Despite their popularity, bag-of-words features have two major weaknesses: they lose the ordering of the words and they also ignore semantics of the words. For example, “powerful,” “strong” and “Paris” are equally distant. In this paper, we propose Paragraph Vector, an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of texts, such as sentences, paragraphs, and documents. Our algorithm represents each document by a dense vector which is trained to predict words in the document. Its construction gives our algorithm the potential to overcome the weaknesses of bag-ofwords models. Empirical results show that Paragraph Vectors outperform bag-of-words models as well as other techniques for text representations. Finally, we achieve new state-of-the-art results on several text classification and sentiment analysis tasks. View details
    On Rectified Linear Units For Speech Processing
    M.D. Zeiler
    M. Ranzato
    R. Monga
    M. Mao
    K. Yang
    P. Nguyen
    G.E. Hinton
    38th International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver(2013)
    Preview abstract Deep neural networks have recently become the gold standard for acoustic modeling in speech recognition systems. The key computational unit of a deep network is a linear projection followed by a point-wise non-linearity, which is typically a logistic function. In this work, we show that we can improve generalization and make training of deep networks faster and simpler by substituting the logistic units with rectified linear units. These units are linear when their input is positive and zero otherwise. In a supervised setting, we can successfully train very deep nets from random initialization on a large vocabulary speech recognition task achieving lower word error rates than using a logistic network with the same topology. Similarly in an unsupervised setting, we show how we can learn sparse features that can be useful for discriminative tasks. All our experiments are executed in a distributed environment using several hundred machines and several hundred hours of speech data. View details
    Preview abstract Dictionaries and phrase tables are the basis of modern statistical machine translation systems. This paper develops a method that can automate the process of generating and extending dictionaries and phrase tables. Our method can translate missing word and phrase entries by learning language structures based on large monolingual data and mapping between languages from small bilingual data. It uses distributed representation of words and learns a linear mapping between vector spaces of languages. Despite its simplicity, our method is surprisingly effective: we can achieve almost 90% precision@5 for translation of words between English and Spanish. This method makes little assumption about the languages, so it can be used to extend and refine dictionaries and translation tables for any language pairs. View details
    Preview abstract Object recognition and localization are important tasks in computer vision. The focus of this work is the incorporation of contextual information in order to improve object recognition and localization. For instance, it is natural to expect not to see an elephant to appear in the middle of an ocean. We consider a simple approach to encapsulate such common sense knowledge using co-occurrence statistics from web documents. By merely counting the number of times nouns (such as elephants, sharks, oceans, etc.) co-occur in web documents, we obtain a good estimate of expected co-occurrences in visual data. We then cast the problem of combining textual co-occurrence statistics with the predictions of image-based classifiers as an optimization problem. The resulting optimization problem serves as a surrogate for our inference procedure. Albeit the simplicity of the resulting optimization problem, it is effective in improving both recognition and localization accuracy. Concretely, we observe significant improvements in recognition and localization rates for both ImageNet Detection 2012 and Sun 2012 datasets. View details
    Fastfood-computing hilbert space expansions in loglinear time
    Alex Smola
    International Conference on Machine Learning(2013) (to appear)
    Preview abstract Despite their successes, what makes kernel methods difficult to use in many large scale problems is the fact that computing the decision function is typically expensive, especially at prediction time. In this paper, we overcome this difficulty by proposing Fastfood, an approximation that accelerates such computation significantly. Key to Fastfood is the observation that Hadamard matrices when combined with diagonal Gaussian matrices exhibit properties similar to dense Gaussian random matrices. Yet unlike the latter, Hadamard and diagonal matrices are inexpensive to multiply and store. These two matrices can be used in lieu of Gaussian matrices in Random Kitchen Sinks (Rahimi & Recht, 2007) and thereby speeding up the computation for a large range of kernel functions. Specifically, Fastfood requires O(n log d) time and O(n) storage to compute n non-linear basis functions in d dimensions, a significant improvement from O(nd) computation and storage, without sacrificing accuracy. We prove that the approximation is unbiased and has low variance. Extensive experiments show that we achieve similar accuracy to full kernel expansions and Random Kitchen Sinks while being 100x faster and using 1000x less memory. These improvements, especially in terms of memory usage, make kernel methods more practical for applications that have large training sets and/or require real-time prediction. View details
    Fastfood - Approximating Kernel Expansions in Loglinear Time
    Alex Smola
    30th International Conference on Machine Learning (ICML), Omnipress(2013)
    Preview abstract Fast nonlinear function classes are crucial for nonparametric estimation, such as in kernel methods. This paper proposes an improvement to random kitchen sinks that offers significantly faster computation in log-linear time without sacrificing accuracy. Furthermore, we show how one may adjust the regularization properties of the kernel simply by changing the spectral distribution of the projection matrix. We provide experimental results which show that even for for moderately small problems we already achieve two orders of magnitude faster computation and three orders of magnitude lower memory footprint. View details
    Grounded compositional semantics for finding and describing images with sentences
    Richard Socher
    Andrej Karpathy
    Chris D. Manning
    Andrew Y. Ng
    Transactions of the Association for Computational Linguistics(2013) (to appear)
    Preview abstract Previous work on Recursive Neural Networks (RNNs) shows that these models can produce compositional feature vectors for accurately representing and classifying sentences or images. However, the sentence vectors of previous models cannot accurately represent visually grounded meaning. We introduce the DTRNN model which uses dependency trees to embed sentences into a vector space in order to retrieve images that are described by those sentences. Unlike previous RNN-based models which use constituency trees, DT-RNNs naturally focus on the action and agents in a sentence. They are better able to abstract from the details of word order and syntactic expression. DT-RNNs outperform other recursive and recurrent neural networks, kernelized CCA and a bag-of-words baseline on the tasks of finding an image that fits a sentence description and vice versa. They also give more similar representations to sentences that describe the same image. View details
    Large Scale Distributed Deep Networks
    Rajat Monga
    Kai Chen
    Matthieu Devin
    Mark Z. Mao
    Marc’Aurelio Ranzato
    Paul Tucker
    Ke Yang
    Andrew Y. Ng
    NIPS(2012)
    Preview abstract Recent work in unsupervised feature learning and deep learning has shown that being able to train large models can dramatically improve performance. In this paper, we consider the problem of training a deep network with billions of parameters using tens of thousands of CPU cores. We have developed a software framework called DistBelief that can utilize computing clusters with thousands of machines to train large models. Within this framework, we have developed two algorithms for large-scale distributed training: (i) Downpour SGD, an asynchronous stochastic gradient descent procedure supporting a large number of model replicas, and (ii) Sandblaster, a framework that supports a variety of distributed batch optimization procedures, including a distributed implementation of L-BFGS. Downpour SGD and Sandblaster L-BFGS both increase the scale and speed of deep network training. We have successfully used our system to train a deep network 30x larger than previously reported in the literature, and achieves state-of-the-art performance on ImageNet, a visual object recognition task with 16 million images and 21k categories. We show that these same techniques dramatically accelerate the training of a more modestly- sized deep network for a commercial speech recognition service. Although we focus on and report performance of these methods as applied to training large neural networks, the underlying algorithms are applicable to any gradient-based machine learning algorithm. View details
    Building high-level features using large scale unsupervised learning
    Marc'Aurelio Ranzato
    Rajat Monga
    Matthieu Devin
    Kai Chen
    Andrew Ng
    International Conference in Machine Learning(2012)
    Preview abstract We consider the problem of building highlevel, class-specific feature detectors from only unlabeled data. For example, is it possible to learn a face detector using only unlabeled images? To answer this, we train a 9-layered locally connected sparse autoencoder with pooling and local contrast normalization on a large dataset of images (the model has 1 billion connections, the dataset has 10 million 200x200 pixel images downloaded from the Internet). We train this network using model parallelism and asynchronous SGD on a cluster with 1,000 machines (16,000 cores) for three days. Contrary to what appears to be a widely-held intuition, our experimental results reveal that it is possible to train a face detector without having to label images as containing a face or not. Control experiments show that this feature detector is robust not only to translation but also to scaling and out-of-plane rotation. We also find that the same network is sensitive to other high-level concepts such as cat faces and human bodies. Starting with these learned features, we trained our network to obtain 15.8% accuracy in recognizing 20,000 object categories from ImageNet, a leap of 70% relative improvement over the previous state-of-the-art. View details
    Recurrent Neural Networks for Noise Reduction in Robust ASR
    Andrew Maas
    Tyler M. O’Neil
    Patrick Nguyen
    Andrew Y. Ng
    INTERSPEECH(2012)
    Preview abstract Recent work on deep neural networks as acoustic models for automatic speech recognition (ASR) have demonstrated substantial performance improvements. We introduce a model which uses a deep recurrent auto encoder neural network to denoise input features for robust ASR. The model is trained on stereo (noisy and clean) audio features to predict clean features given noisy input. The model makes no assumptions about how noise affects the signal, nor the existence of distinct noise environments. Instead, the model can learn to model any type of distortion or additive noise given sufficient training data. We demonstrate the model is competitive with existing feature denoising approaches on the Aurora2 task, and outperforms a tandem approach where deep networks are used to predict phoneme posteriors directly. View details
    Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis
    Will Zou
    Serena Yeung
    Andrew Y. Ng
    Conference on Computer Vision and Pattern Recognition(2011)
    Preview abstract Previous work on action recognition has focused on adapting hand-designed local features, such as SIFT or HOG, from static images to the video domain. In this paper, we propose using unsupervised feature learning as a way to learn features directly from video data. More specifically, we present an extension of the Independent Subspace Analysis algorithm to learn invariant spatio-temporal features from unlabeled video data. We discovered that, despite its simplicity, this method performs surprisingly well when combined with deep learning techniques such as stacking and convolution to learn hierarchical representations. By replacing hand-designed features with our learned features, we achieve classification results superior to all previous published results on the Hollywood2, UCF, KTH and YouTube action recognition datasets. On the challenging Hollywood2 and YouTube action datasets we obtain 53.3% and 75.8% respectively, which are approximately 5% better than the current best published results. Further benefits of this method, such as the ease of training and the efficiency of training and prediction, will also be discussed. You can download our code and learned spatio-temporal features here: http://ai.stanford.edu/∼wzou/ View details
    Learning invariant features of tumor signature
    Ju Han
    Paul Spellman
    Alexander Borowsky
    Bahram Parvin
    ISBI(2012)
    Preview abstract We present a novel method for automated learning of features from unlabeled image patches for classification of tumor architecture. In contrast to manually designed feature detectors (e.g., Gabor basis function), the proposed method utilizes independent subspace analysis to reconstruct a natural representation. Learning is described as a two-layer network with non-linear responses, where the second layer represents subspace structures. The technique is applied to tissue sections for characterizing necrosis, apoptotic, and viable regions of Glioblastoma Multifrome (GBM) from TCGA dataset. We show that the performance of this method is better than expert designed representation, therefore, promising a wider application of self-learning strategies for tissue characterization. View details
    On Optimization Methods for Deep Learning
    Jiquan Ngiam
    Adam Coates
    Abhik Lahiri
    Bobby Prochnow
    Andrew Y. Ng
    ICML(2011)
    Preview abstract The predominant methodology in training deep learning advocates the use of stochastic gradient descent methods (SGDs). Despite its ease of implementation, SGDs are diffi- cult to tune and parallelize. These problems make it challenging to develop, debug and scale up deep learning algorithms with SGDs. In this paper, we show that more sophisticated off-the-shelf optimization methods such as Limited memory BFGS (L-BFGS) and Conjugate gradient (CG) with line search can significantly simplify and speed up the process of pretraining deep algorithms. In our experiments, the difference between LBFGS/CG and SGDs are more pronounced if we consider algorithmic extensions (e.g., sparsity regularization) and hardware extensions (e.g., GPUs or computer clusters). Our experiments with distributed optimization support the use of L-BFGS with locally connected networks and convolutional neural networks. Using L-BFGS, our convolutional network model achieves 0.69% on the standard MNIST dataset. This is a state-of-theart result on MNIST among algorithms that do not use distortions or pretraining. View details
    ICA with Reconstruction Cost for Efficient Overcomplete Feature Learning
    Alex Karpenko
    Jiquan Ngiam
    Andrew Y. Ng
    NIPS(2011)
    Preview abstract Independent Components Analysis (ICA) and its variants have been successfully used for unsupervised feature learning. However, standard ICA requires an orthonoramlity constraint to be enforced, which makes it difficult to learn overcomplete features. In addition, ICA is sensitive to whitening. These properties make it challenging to scale ICA to high dimensional data. In this paper, we propose a robust soft reconstruction cost for ICA that allows us to learn highly overcomplete sparse features even on unwhitened data. Our formulation reveals formal connections between ICA and sparse autoencoders, which have previously been observed only empirically. Our algorithm can be used in conjunction with off-the-shelf fast unconstrained optimizers. We show that the soft reconstruction cost can also be used to prevent replicated features in tiled convolutional neural networks. Using our method to learn highly overcomplete sparse features and tiled convolutional neural networks, we obtain competitive performances on a wide variety of object recognition tasks. We achieve state-of-the-art test accuracies on the STL-10 and Hollywood2 datasets. View details
    Bundle Methods for Regularized Risk Minimization
    Choon Hui Teo
    S. V. N. Vishwanathan
    Alex J. Smola
    Journal of Machine Learning Research, 11(2010), pp. 311-365
    Tiled Convolutional Neural Networks
    Jiquan Ngiam
    Zhenghao Chen
    Pang Wei Koh
    Andrew Y. Ng
    NIPS(2010)
    Preview abstract Convolutional neural networks (CNNs) have been successfully applied to many tasks such as digit and object recognition. Using convolutional (tied) weights significantly reduces the number of parameters that have to be learned, and also allows translational invariance to be hard-coded into the architecture. In this paper, we consider the problem of learning invariances, rather than relying on hardcoding. We propose tiled convolution neural networks (Tiled CNNs), which use a regular “tiled” pattern of tied weights that does not require that adjacent hidden units share identical weights, but instead requires only that hidden units k steps away from each other to have tied weights. By pooling over neighboring units, this architecture is able to learn complex invariances (such as scale and rotational invariance) beyond translational invariance. Further, it also enjoys much of CNNs’ advantage of having a relatively small number of learned parameters (such as ease of learning and greater scalability). We provide an efficient learning algorithm for Tiled CNNs based on Topographic ICA, and show that learning complex invariant features allows us to achieve highly competitive results for both the NORB and CIFAR-10 datasets. View details
    Low-cost Accelerometers for Robotic Manipulator Perception
    Morgan Quigley
    Reuben Brewer
    Sai P. Soundararaj
    Vijay Pradeep
    Andrew Y. Ng
    IROS(2010)
    Preview abstract We present a series of experiments which explore the use of consumer-grade accelerometers as joint position sensors for robotic manipulators. We show that 6- and 7-dof joint angle estimation is possible by using one 3-d accelerometer for each pair of joints. We demonstrate two calibration approaches and experimental results using accelerometer-based control in both position-control and torque-control regimes. We present a manipulator design combining accelerometer-based sensing with low-cost actuation, and conclude by demonstrating the utility of consumer-grade accelerometers even on high-precision manipulators. View details
    Learning to grasp objects with multiple contact points
    David Kamm
    Arda Kara
    Andrew Y. Ng
    ICRA(2010)
    Preview abstract We consider the problem of grasping novel objects and its application to cleaning a desk. A recent successful approach applies machine learning to learn one grasp point in an image and a point cloud. Although those methods are able to generalize to novel objects, they yield suboptimal results because they rely on motion planner for finger placements. In this paper, we extend their method to accommodate grasps with multiple contacts. This approach works well for many humanmade objects because it models the way we grasp objects. To further improve the grasping, we also use a method that learns the ranking between candidates. The experiments show that our method is highly effective compared to a state-of-the-art competitor. View details
    Grasping Novel Objects with Depth Segmentation
    Deepak Rao
    Thanathorn Phoka
    Morgan Quigley
    Attawith Sudsang
    Andrew Y. Ng
    IROS(2010)
    Preview abstract We consider the task of grasping novel objects and cleaning fairly cluttered tables with many novel objects. Recent successful approaches employ machine learning algorithms to identify points on the scene that the robot should grasp. In this paper, we show that the task can be significantly simplified by using segmentation, especially with depth information. A supervised localization method is employed to select graspable segments. We also propose a shape completion and grasp planner method which takes partial 3D information and plans the most stable grasping strategy. Extensive experiments on our robot demonstrate the effectiveness of our approach. View details
    High-Accuracy 3D Sensing for Mobile Manipulation: Improving Object Detection and Door Opening
    Morgan Quigley
    Siddarth Batra
    Stephen Gould
    Ellen Klingbeil
    Ashley Wellman
    Andrew Y. Ng
    ICRA(2009)
    Preview abstract High-resolution 3D scanning can improve the performance of object detection and door opening, two tasks critical to the operation of mobile manipulators in cluttered homes and workplaces. We discuss how high-resolution depth information can be combined with visual imagery to improve the performance of object detection beyond what is (currently) achievable with 2D images alone, and we present door-opening and inventory-taking experiments. View details
    Scalable Learning for Object Detection with GPU Hardware
    Adam Coates
    Paul Baumstarck
    Andrew Y. Ng
    IROS(2009)
    Preview abstract We consider the problem of robotic object detection of such objects as mugs, cups, and staplers in indoor environments. While object detection has made significant progress in recent years, many current approaches involve extremely complex algorithms, and are prohibitively slow when applied to large scale robotic settings. In this paper, we describe an object detection system that is designed to scale gracefully to large data sets and leverages upward trends in computational power (as exemplified by Graphics Processing Unit (GPU) technology) and memory. We show that our GPU-based detector is up to 90 times faster than a well-optimized software version and can be easily trained on millions of examples. Using inexpensive off-the-shelf hardware, it can recognize multiple object types reliably in just a few seconds per frame. View details
    Proximal regularization for online and batch learning
    Chuong B. Do
    Chuan-Sheng Foo
    ICML(2009)
    Preview abstract Many learning algorithms rely on the curvature (in particular, strong convexity) of regularized objective functions to provide good theoretical performance guarantees. In practice, the choice of regularization penalty that gives the best testing set performance may result in objective functions with little or even no curvature. In these cases, algorithms designed specifically for regularized objectives often either fail completely or require some modification that involves a substantial compromise in performance. We present new online and batch algorithms for training a variety of supervised learning models (such as SVMs, logistic regression, structured prediction models, and CRFs) under conditions where the optimal choice of regularization parameter results in functions with low curvature. We employ a technique called proximal regularization, in which we solve the original learning problem via a sequence of modified optimization tasks whose objectives are chosen to have greater curvature than the original problem. Theoretically, our algorithms achieve low regret bounds in the online setting and fast convergence in the batch setting. Experimentally, our algorithms improve upon state-of-the-art techniques, including Pegasos and bundle methods, on medium and large-scale SVM and structured learning tasks. View details
    Estimating Labels from Label Proportions
    Novi Quadrianto
    Alex J. Smola
    Tibério S. Caetano
    Journal of Machine Learning Research, 10(2009), pp. 2349-2374
    Measuring Invariances in Deep Networks
    Ian Goodfellow
    Andrew Saxe
    Honglak Lee
    Andrew Y. Ng
    NIPS(2009)
    Preview abstract For many pattern recognition tasks, the ideal input feature would be invariant to multiple confounding properties (such as illumination and viewing angle, in computer vision applications). Recently, deep architectures trained in an unsupervised manner have been proposed as an automatic method for extracting useful features. However, it is difficult to evaluate the learned features by any means other than using them in a classifier. In this paper, we propose a number of empirical tests that directly measure the degree to which these learned features are invariant to different input transformations. We find that stacked autoencoders learn modestly increasingly invariant features with depth when trained on natural images. We find that convolutional deep belief networks learn substantially more invariant features in each layer. These results further justify the use of “deep” vs. “shallower” representations, but suggest that mechanisms beyond merely stacking one autoencoder on top of another may be important for achieving invariance. Our evaluation metrics can also be used to evaluate future work in deep learning, and thus help the development of future algorithms. View details
    Estimating labels from label proportions
    Novi Quadrianto
    Alex J. Smola
    Tibério S. Caetano
    ICML(2008), pp. 776-783
    Learning Graph Matching
    Tibério S. Caetano
    Julian John McAuley
    Li Cheng
    Alex J. Smola
    CoRR, abs/0806.2890(2008)
    Tighter Bounds for Structured Estimation
    Olivier Chapelle
    Chuong B. Do
    Alexander J. Smola
    Choon Hui Teo
    NIPS(2008), pp. 281-288
    COFI RANK - Maximum Margin Matrix Factorization for Collaborative Ranking
    Markus Weimer
    Alexandros Karatzoglou
    Alex J. Smola
    NIPS(2007)
    Learning Graph Matching
    Tibério S. Caetano
    Li Cheng
    Alex J. Smola
    ICCV(2007), pp. 1-8
    Bundle Methods for Machine Learning
    Alex J. Smola
    S. V. N. Vishwanathan
    NIPS(2007)
    Direct Optimization of Ranking Measures
    Alexander J. Smola
    CoRR, abs/0704.3359(2007)
    A scalable modular convex solver for regularized risk minimization
    Choon Hui Teo
    Alex J. Smola
    S. V. N. Vishwanathan
    KDD(2007), pp. 727-736
    Transductive Gaussian Process Regression with Automatic Model Selection
    Alexander J. Smola
    Thomas Gärtner
    ECML(2006), pp. 306-317
    Nonparametric Quantile Estimation
    Ichiro Takeuchi
    Tim D. Sears
    Alexander J. Smola
    Journal of Machine Learning Research, 7(2006), pp. 1231-1264
    Simpler knowledge-based support vector machines
    Alex J. Smola
    Thomas Gärtner
    ICML(2006), pp. 521-528
    Learning to Rank with Nonsmooth Cost Functions
    Christopher Burges
    Robert Ragno
    NIPS(2006)
    Preview abstract The quality measures used in information retrieval are particularly difficult to optimize directly, since they depend on the model scores only through the sorted order of the documents returned for a given query. Thus, the derivatives of the cost with respect to the model parameters are either zero, or are undefined. In this paper, we propose a class of simple, flexible algorithms, called LambdaRank, which avoids these difficulties by working with implicit cost functions. We describe LambdaRank using neural network models, although the idea applies to any differentiable function class. We give necessary and sufficient conditions for the resulting implicit cost function to be convex, and we show that the general method has a simple mechanical interpretation. We demonstrate significantly improved accuracy, over a state-of-the-art ranking algorithm, on several datasets. We also show that LambdaRank provides a method for significantly speeding up the training phase of that ranking algorithm. Although this paper is directed towards ranking, the proposed method can be extended to any non-smooth and multivariate cost functions. View details
    Heteroscedastic Gaussian process regression
    Alexander J. Smola
    Stéphane Canu
    ICML(2005), pp. 489-496
    Large-Scale Multiclass Transduction
    Thomas Gärtner
    Simon Burton
    Alex J. Smola
    S. V. N. Vishwanathan
    NIPS(2005)