Kevin Swersky

Kevin Swersky

Research Areas

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Data-Driven Offline Optimization for Architecting Hardware Accelerators
    Aviral Kumar
    Sergey Levine
    International Conference on Learning Representations 2022 (to appear)
    Preview abstract With the goal of achieving higher efficiency, the semiconductor industry has gradually reformed towards application-specific hardware accelerators. While such a paradigm shift is already starting to show promising results, designers need to spend considerable manual effort and perform large number of time-consuming simulations to find accelerators that can accelerate multiple target applications while obeying design constraints. Moreover, such a ``simulation-driven'' approach must be re-run from scratch every time the target applications or constraints change. An alternative paradigm is to use a ``data-driven'', offline approach that utilizes logged simulation data, to architect hardware accelerators, without needing any form of simulation. Such an approach not only alleviates the need to run time-consuming simulation, but also enables data reuse and applies even when target applications change. In this paper, we develop such a data-driven offline optimization method for designing hardware accelerators, PRIME, that enjoys all of these properties. Our approach learns a conservative, robust estimate of the desired cost function, utilizes infeasible points and optimizes the design against this estimate without any additional simulator queries during optimization. View details
    A Hierarchical Neural Model of Data Prefetching
    Zhan Shi
    Akanksha Jain
    Calvin Lin
    Architectural Support for Programming Languages and Operating Systems (ASPLOS)(2021)
    Preview abstract This paper presents Voyager, a novel neural network for data prefetching. Unlike previous neural models for prefetching, which are limited to learning delta correlations, our model can also learn address correlations, which are important for prefetching irregular sequences of memory accesses. The key to our solution is its hierarchical structure that separates addresses into pages and offsets and that introduces a mechanism for learning important relations among pages and offsets. Voyager provides significant prediction benefits over current data prefetchers. For a set of irregular programs from the SPEC 2006 and GAP benchmark suites, Voyager sees an average IPC improvement of 41.6% over a system with no prefetcher, compared with 21.7% and 28.2%, respectively, for idealized Domino and ISB prefetchers. We also find that for two commercial workloads for which current data prefetchers see very little benefit, Voyager dramatically improves both accuracy and coverage. At present, slow training and prediction preclude neural models from being practically used in hardware, but Voyager’s overheads are significantly lower—in every dimension—than those of previous neural models. For example, computation cost is reduced by 15-20×, and storage overhead is reduced by 110-200×. Thus, Voyager represents a significant step towards a practical neural prefetcher. View details
    Preview abstract Energy-Based Models (EBMs) present a flexible and appealing way to represent uncertainty. Despite recent advances, training EBMs on high-dimensional data remains a challenging problem as the state-of-the-art approaches are costly, unstable, and require considerable tuning and domain expertise to apply successfully. In this work, we present a simple method for training EBMs at scale which uses an entropy-regularized generator to amortize the MCMC sampling typically used in EBM training. We improve upon prior MCMC-based entropy regularization methods with a fast variational approximation. We demonstrate the effectiveness of our approach by using it to train tractable likelihood models. Next, we apply our estimator to the recently proposed Joint Energy Model (JEM), where we match the original performance with faster and stable training. This allows us to extend JEM models to semi-supervised classification on tabular data from a variety of continuous domains. View details
    Preview abstract We propose a general and scalable approximate sampling strategy for probabilistic models with discrete variables. Our approach uses gradients of the likelihood function with respect to its discrete inputs to propose updates in a Metropolis-Hastings sampler. We show empirically that this approach outperforms generic samplers in a number of difficult settings including Ising models, Potts models, restricted Boltzmann machines, and factorial hidden Markov models. We also demonstrate the use of our improved sampler for training deep energy-based models on high dimensional discrete data. This approach outperforms variational auto-encoders and existing energy-based models. Finally, we give bounds showing that our approach is near-optimal in the class of samplers which propose local updates. View details
    Big Self-Supervised Models are Strong Semi-Supervised Learners
    Ting Chen
    Simon Kornblith
    Mohammad Norouzi
    Geoffrey Everest Hinton
    Advances in Neural Information Processing Systems(2020)
    Preview abstract One paradigm for learning from few labeled examples while making best use of a large amount of unlabeled data is unsupervised pretraining followed by supervised fine-tuning. Although this paradigm uses unlabeled data in a task-agnostic way, in contrast to common approaches to semi-supervised learning for computer vision, we show that it is surprisingly effective for semi-supervised learning on ImageNet. A key ingredient of our approach is the use of big (deep and wide) networks during pretraining and fine-tuning. We find that, the fewer the labels, the more this approach (task-agnostic use of unlabeled data) benefits from a bigger network. After fine-tuning, the big network can be further improved and distilled into a much smaller one with little loss in classification accuracy by using the unlabeled examples for a second time, but in a task-specific way. The proposed semi-supervised learning algorithm can be summarized in three steps: unsupervised pretraining of a big ResNet model using SimCLRv2, supervised fine-tuning on a few labeled examples, and distillation with unlabeled examples for refining and transferring the task-specific knowledge. This procedure achieves 73.9% ImageNet top-1 accuracy with just 1% of the labels (≤13 labeled images per class) using ResNet-50, a 10× improvement in label efficiency over the previous state-of-the-art. With 10% of labels, ResNet-50 trained with our method achieves 77.5% top-1 accuracy, outperforming standard supervised training with all of the labels. View details
    Preview abstract Program execution speed critically depends on increasing cache hits, as cache hits are orders of magnitude faster than misses. To increase cache hits, we focus on the problem of cache replacement: choosing which cache line to evict upon inserting a new line. This is challenging because it requires planning far ahead and currently there is no known practical solution. As a result, current replacement policies typically resort to heuristics designed for specific common access patterns, which fail on more diverse and complex access patterns. In contrast, we propose an imitation learning approach to automatically learn cache access patterns by leveraging Belady’s, an oracle policy that computes the optimal eviction decision given the future cache accesses. While directly applying Belady’s is infeasible since the future is unknown, we train a policy conditioned only on past accesses that accurately approximates Belady’s even on diverse and complex access patterns, and call this approach PARROT. When evaluated on 13 of the most memory-intensive SPEC applications, PARROT increases cache miss rates by 20% over the current state of the art. In addition, on a large-scale web search benchmark, PARROT increases cache hit rates by 61% over a conventional LRU policy. We release a Gym environment to facilitate research in this area, as data is plentiful, and further advancements can have significant real-world impact. View details
    Preview abstract Few-shot classification refers to learning a classifier for new classes given only a few examples. While a plethora of models have emerged to tackle this recently, we find the current procedure and datasets that are used to systematically assess progress in this setting lacking. To address this, we propose META-DATASET: a new benchmark for training and evaluating few-shot classifiers that is large-scale, consists of multiple datasets, and presents more natural and realistic tasks. The aim is to measure the ability of state-of the-art models to leverage diverse sources of data to achieve higher generalization, and to evaluate that generalization ability in a more challenging setting. We additionally measure robustness of current methods to variations in the number of available examples and the number of classes. Finally our extensive empirical evaluation leads us to identify weaknesses in Prototypical Networks and MAML, two popular few-shot classification methods, and to propose a new method, ProtoMAML, which achieves improved performance on our benchmark. View details
    Learned Hardware/Software Co-Design of Neural Accelerators
    Zhan Shi
    Chirag Sakhuja
    Calvin Lin
    ML for Systems Workshop at NeurIPS 2020(2020)
    Preview abstract The use of deep learning has grown at an exponential rate, giving rise to numerous specialized hardware and software systems for deep learning. Because the design space of deep learning software stacks and hardware accelerators is diverse and vast, prior work considers software optimizations separately from hardware architectures, effectively reducing the search space. Unfortunately, this bifurcated approach means that many profitable design points are never explored. This paper instead casts the problem as hardware/software co-design, with the goal of automatically identifying desirable points in the joint design space. The key to our solution is a new constrained Bayesian optimization framework that avoids invalid solutions by exploiting the highly constrained features of this design space, which are semi-continuous/semi-discrete. We evaluate our optimization framework by applying it to a variety of neural models, improving the energy-delay product by 18% (ResNet) and 40% (DQN) over hand-tuned state-of-the-art systems, as well as demonstrating strong results on other neural network architectures, such as MLPs and Transformers. View details
    Preview abstract The looming end of Moore's Law and ascending use of deep learning drives the design of custom accelerators that are optimized for specific neural architectures. Accelerator design forms a challenging constrained optimization problem over a complex, high-dimensional and structured input space with a costly to evaluate objective function. Existing approaches for accelerator design are sample-inefficient do not transfer knowledge between related optimizations tasks with different design constraints (e.g. area budget) or neural architecture configurations. In this work, we propose a transferable architecture exploration framework, dubbed Apollo, that leverages recent advances in black-box function optimization for sample-efficient accelerator design. We use Apollo to optimize accelerator configurations of a diverse set of neural architectures with alternative design constraints. We show that Apollo finds optimal design configurations more sample-efficiently than baseline approaches. We further show that transferring knowledge between target architectures with different design constraints helps to find optimal configurations faster. This encouraging outcome portrays a promising path forward in shortening the timeline for accelerator design. View details
    Preview abstract As the performance of computer systems stagnates due to the end of Moore’s Law, there is a need for new models that can understand and optimize the execution of general purpose code. While there is a growing body of work on using Graph Neural Networks (GNNs) to learn static representations of source code, these representations do not understand how code executes at runtime. In this work, we propose a new approach using GNNs to learn fused representations of general source code and its execution. Our approach defines a multi-task GNN over low-level representations of source code and program state (i.e., assembly code and dynamic memory states), converting complex source code constructs and data structures into a simpler, more uniform format. We show that this leads to improved performance over similar methods that do not use execution and it opens the door to applying GNN models to new tasks that would not be feasible from static code alone. As an illustration of this, we apply the new model to challenging dynamic tasks (branch prediction and prefetching) from the SPEC CPU benchmark suite, outperforming the state-of-the-art by 26% and 45% respectively. Moreover, we use the learned fused graph embeddings to demonstrate transfer learning with high performance on an indirectly related algorithm classification task. View details