Jump to Content
Mark Sandler

Mark Sandler

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract In this paper we propose augmenting Vision Transformer models with learnable memory tokens. Our approach allows the model to adapt to new tasks, using few parameters, while optionally preserving its capabilities on previously learned tasks. At each layer we introduce a set of learnable embedding vectors that provide contextual information useful for specific datasets. We call these "memory tokens". We show that augmenting a model with just a handful of such tokens per layer significantly improves accuracy when compared to conventional head-only fine-tuning, and performs only slightly below the significantly more expensive full fine-tuning. We then propose an attention-masking approach that enables extension to new downstream tasks, with a computation reuse. In this setup in addition to being parameters efficient, models can execute both old and new tasks as a part of single inference at a small incremental cost. View details
    Preview abstract In this work we propose a HyperTransformer, a Transformer-based model for supervised and semi-supervised few-shot learning that generates weights of a convolutional neural network (CNN) directly from support samples. Since the dependence of a small generated CNN model on a specific task is encoded by a high-capacity Transformer model, we effectively decouple the complexity of the large task space from the complexity of individual tasks. Our method is particularly effective for small target CNN architectures where learning a fixed universal task-independent embedding is not optimal and better performance is attained when the information about the task can modulate all model parameters. For larger models we discover that generating the last layer alone allows us to produce competitive or better results than those obtained with state-of-the-art methods while being end-to-end differentiable. View details
    Preview abstract Conditional computation and modular networks have been recently proposed for multitask learning and other problems as a way to decompose problem solving into multiple reusable computational blocks. We propose a new approach for learning modular networks based on the isometric version of ResNet with all residual blocks having the same configuration and the same number of parameters. This architectural choice allows adding, removing and changing the order of residual blocks. In our method, the modules can be invoked repeatedly and allow knowledge transfer to novel tasks by adjusting the order of computation. This allows soft weight sharing between tasks with only a small increase in the number of parameters. We show that our method leads to interpretable self-organization of modules in case of multi-task learning, transfer learning and domain adaptation while achieving competitive results on those tasks. From practical perspective, our approach allows to: (a) reuse existing modules for learning new task by adjusting the computation order, (b) use it for unsupervised multi-source domain adaptation to illustrate that adaptation to unseen data can be achieved by only manipulating the order of pretrained modules, (c) show how our approach can be used to increase accuracy of existing architectures for image classification tasks such as ImageNet, without any parameter increase, by reusing the same block multiple times. View details
    Preview abstract In this paper, we introduce a new type of generalized neural network where neurons and synapses maintain multiple states. We show that classical gradient-based backpropagation in neural networks can be seen as a special case of a two-state network where one state is used for activations and another for gradients, with update rules derived from the chain rule. In our generalized framework, networks have neither explicit notion of nor ever receive gradients. The synapses and neurons are updated using a bidirectional Hebb-style update rule parameterized by a shared low-dimensional "genome". We show that such genomes can be meta-learned from scratch, using either conventional optimization techniques, or evolutionary strategies, such as CMA-ES. Resulting update rules generalize to unseen tasks and train faster than gradient descent based optimizers for several standard computer vision and synthetic tasks. View details
    SpotPatch: Parameter-Efficient Transfer Learning for Mobile Object Detection
    Keren Ye
    Adriana Kovashka
    Menglong Zhu
    Andrew Howard
    Proceedings of the Asian Conference on Computer Vision (ACCV), Springer (2020)
    Preview abstract Deep learning based object detectors are commonly deployed on mobile devices to solve a variety of tasks. For maximum accuracy, each detector is usually trained to solve one single specific task, and comes with a completely independent set of parameters. While this guarantees high performance, it is also highly inefficient, as each model has to be separately downloaded and stored. In this paper we address the question: can task-specific detectors be trained and represented as a shared set of weights, plus a very small set of additional weights for each task? The main contributions of this paper are the following: 1) we perform the first systematic study of parameter-efficient transfer learning techniques for object detection problems; 2) we propose a technique to learn a model patch with a size that is dependent on the difficulty of the task to be learned, and validate our approach on 10 different object detection tasks. Our approach achieves similar accuracy as previously proposed approaches, while being significantly more compact. View details
    Preview abstract In order to prepare for and control the continued spread of the COVID-19 pandemic while minimizing its economic impact, the world needs to be able to estimate and predict COVID-19’s spread. Unfortunately, we cannot directly observe the prevalence or growth rate of COVID-19; these must be inferred using some kind of model. We propose a hierarchical Bayesian extension to the classic susceptible-exposed-infected-removed (SEIR) compartmental model that adds compartments to account for isolation and death and allows the infection rate to vary as a function of both mobility data collected from mobile phones and a latent time-varying factor that accounts for changes in behavior not captured by mobility data. Since confirmed-case data is unreliable, we infer the model’s parameters conditioned on deaths data. We replace the exponential-waiting-time assumption of classic compartmental models with Erlang distributions, which allows for a more realistic model of the long lag between exposure and death. The mobility data gives us a leading indicator that can quickly detect changes in the pandemic’s local growth rate and forecast changes in death rates weeks ahead of time. This is an analysis of observational data, so any causal interpretations of the model's inferences should be treated as suggestive at best; nonetheless, the model’s inferred relationship between different kinds of trips and the infection rate do suggest some possible hypotheses about what kinds of activities might contribute most to COVID-19’s spread. View details
    Preview abstract Despite the success of deep neural networks (DNNs), state-of-the-art models are too large to deploy on low-resource devices or common server configurations in which multiple models are held in memory. Model compression methods address this limitation by reducing the memory footprint, latency, or energy consumption of a model with minimal impact on accuracy. We focus on the task of reducing the number of learnable variables in the model. In this work we combine ideas from weight hashing and dimensionality reductions resulting in a simple and powerful structured multi-hashing method based on matrix products that allows direct control of model size of any deep network and is trained end-to-end. We demonstrate the strength of our approach by compressing models from the ResNet, EfficientNet, and MobileNet architecture families. Our method allows us to drastically decrease the number of variables while maintaining high accuracy. For instance, by applying our approach to EfficentNet-B4 (16M parameters) we reduce it to to the size of B0 (5M parameters), while gaining over 3% in accuracy over B0 baseline. On the commonly used benchmark CIFAR10 we reduce the ResNet32 model by 75% with no loss in quality, and are able to do a 10x compression while still achieving above 90% accuracy. View details
    Preview abstract We explore the question of how the resolution of input image affects the performance of a neural network when compared to the resolution of hidden layers. Image resolution is frequently used as a hyper parameter providing a trade-off between model performance and accuracy. An intuitive interpretation is that the decay in accuracy when reducing input resolution, is caused by the reduced information content in the low-resolution input. Left unsaid often the fact that this also reduces the model's internal resolution. In this paper, we show that up-to a point the resolution plays very little role in the network performance. We show that another obvious hypothesis, such as changes in receptive fields, is not the primary root causes either. We then use this insight, to develop novel neural network architectures that we call {\it isometric neural networks} that maintain fixed internal resolution throughout their entire depth and demonstrate that it lead of high accuracy models with low activation footprint and a parameter count. \end{abstract} View details
    Preview abstract A new method for learning image attention masks in a semi-supervised setting based on the Information Bottleneck principle is proposed. Provided with a set of labeled images, the mask generation model is minimizing mutual information between the input and the masked image while maximizing the mutual information between the same masked image and the image label. In contrast with other approaches, our attention model produces a boolean rather than a continuous mask thus entirely concealing information from masked-out pixels. Using a set of synthetic datasets based on MNIST and CIFAR10 and a SVHN dataset, we demonstrate that our method can successfully attend to features defining the image class. We also discuss potential drawbacks of our methods and propose a mask randomization technique to alleviate one of them. View details
    Preview abstract In this paper we introduce a novel method that enables parameter efficient transfer and multitask learning. We show that by reusing more than 95\% of the parameters we can re-purpose neural networks to solve very different types of problems such as going from COCO-dataset SSD detection to Imagenet classification. Our approach allows both simultaneous (e.g. multi-task) learning as well as sequential fine-tuning where we change the already trained networks to solve a different problem. We show that our approach leads to significant increase in accuracy when compared to traditional logits-only fine-tuning while using much fewer parameters. Interestingly, for multi-task learning our approach sometimes acts as a regularizer often leading to improved performance when compared to models trained on a single task. Our approach has multiple immediate applications. It can be used to dramatically increase the number of models available in resource-constrained settings, since the marginal cost of a new model is now less than 5\% of the full model. The constrained fine-tuning enables better generalization when limited amount data is available. We evaluate our approach on multiple datasets and multiple models. View details
    Preview abstract Designing convolutional neural networks (CNN) for mobile devices is challenging because mobile models need to be small and fast, yet still accurate. Although significant efforts have been dedicated to design and improve mobile CNNs on all dimensions, it is very difficult to manually balance these trade-offs when there are so many architectural possibilities to consider. In this paper, we propose an automated mobile neural architecture search (MNAS) approach, which explicitly incorporate model latency into the main objective so that the search can identify a model that achieves a good trade-off between accuracy and latency. Unlike previous work, where latency is considered via another, often inaccurate proxy (e.g., FLOPS), our approach directly measures real-world inference latency by executing the model on mobile phones. To further strike the right balance between flexibility and search space size, we propose a novel factorized hierarchical search space that encourages layer diversity throughout the network. Experimental results show that our approach consistently outperforms state-of-the-art mobile CNN models across multiple vision tasks. On the ImageNet classification task, our MnasNet achieves 75.2% top-1 accuracy with 78ms latency on a Pixel phone, which is 1.8× faster than MobileNetV2 [29] with 0.5% higher accuracy and 2.3× faster than NASNet [36] with 1.2% higher accuracy. Our MnasNet also achieves better mAP quality than MobileNets for COCO object detection. View details
    Preview abstract In this paper we describe a new mobile architecture MobileNetV2 that improves the state of the art performance of mobile models on multiple benchmarks across a spectrum of different model sizes. MobileNetV2 is based on an inverted residual structure where the input and output of the residual block are thin bottleneck layers, while the intermediate layer is an expanded representation that uses light weight depthwise convolutions to filter features. Additionally, we find that it is important to not use non-linearities in the narrow layers in order to maintain representational power. We demonstrate that this improves performance and provide an intuition that led to this design. Finally, our approach allows a decoupling of the input/output domains from the expressiveness of the transformation, which provides a convenient framework for further analysis. We measure our performance on ImageNet \cite{Russakovsky:2015:ILS:2846547.2846559} classification, VOC image segmentation \cite{PASCAL} and COCO object detection \cite{COCO} datasets, and evaluate the trade-offs between accuracy, and number of multiply adds, and number of parameters View details
    CycleGAN, a Master of Steganography
    Casey Chu
    NIPS 2017 Workshop “Machine Deception” (2017)
    Preview abstract CycleGAN is one of the latest successful approaches to learning a correspondence between two distinct probability distributions. However, it may not always be possible or easy to find a natural one-to-one mapping between two domains. We demonstrate that in such cases CycleGAN model tends to "hide" at least some information about the input sample in the indistinguishable noise added to the output. This makes the network output look "realistic", while also allowing the complementary transformation to recover the original sample and thus satisfy cycle consistency requirement. View details
    Preview abstract Deep convolutional networks are well-known for their high computational and memory demands. Given limited resources, how does one design a network that balances its size, training time, and prediction accuracy? A surprisingly effective approach to trade accuracy for size and speed is to simply reduce the number of channels in each convolutional layer by a fixed fraction and retrain the network. In many cases this leads to significantly smaller networks with only minimal changes to accuracy. In this paper, we take a step further by empirically examining a strategy for deactivating connections between filters in convolutional layers in a way that allows us to harvest savings both in run-time and memory for many network architectures. More specifically, we generalize 2D convolution to use a channel-wise sparse connection structure and show that this leads to significantly better results than the baseline approach for large networks including VGG and Inception V3. View details
    Preview abstract Deep neural networks have dramatically advanced the state of the art for many areas of machine learning. Recently they have been shown to have a remarkable ability to generate highly complex visual artifacts such as images and text rather than simply recognize them. In this work we use neural networks to effectively invert low-dimensional face embeddings while producing realistically looking consistent images. Our contribution is twofold, first we show that a gradient ascent style approaches can be used to reproduce consistent images, with a help of a guiding image. Second, we demonstrate that we can train a separate neural network to effectively solve the minimization problem in one pass, and generate images in real-time. We then evaluate the loss imposed by using a neural network instead of the gradient descent by comparing the final values of the minimized loss function. View details
    Modeling the Parallel Execution of Black-Box Services
    Gideon Mann
    Darja Krushevskaja
    Sudipto Guha
    Eyal Even-Dar
    HotCloud, Usenix (2011)
    Diagnosing Latency in Multi-Tier Black-Box Services
    Gideon Mann
    5th Workshop on Large Scale Distributed Systems and Middleware (LADIS 2011) (to appear)
    Preview abstract As multi-tier cloud applications become pervasive, we need better tools for understanding their performance. This paper presents a system that analyzes observed or desired changes to end-to-end latency pro le in a large distributed application, and identi fies their underlying causes. It recognizes changes to system con guration, workload, or performance of individual services that lead to the observed or desired outcome. Experiments on an industrial datacenter demonstrate the utility of the system. View details
    Using Mixture Models for Collaborative Filtering
    Jon Kleinberg
    Journal of Computer and System Science, vol. 74, no. 1 (2008), pp. 49-69
    Theory research at Google
    Nir Ailon
    Florin Constantin
    Eyal Even-Dar
    Gereon Frahling
    Monika R. Henzinger
    S. Muthukrishnan
    Noam Nisan
    Anastasios Sidiropoulos
    SIGACT News, vol. 39 (2008), pp. 10-28
    Understanding latency variations of black box services
    Darja Krushevskaja
    22nd International World Wide Web Conference, WWW '13 (2013), pp. 703-714
    Privacy via pseudorandom sketches
    Nina Mishra
    PODS (2006), pp. 143-152
    On the use of linear programming for unsupervised text classification
    KDD (2005), pp. 256-264
    On Learning Mixtures of Heavy-Tailed Distributions
    Anirban Dasgupta
    John E. Hopcroft
    Jon M. Kleinberg
    FOCS (2005), pp. 491-500
    Using mixture models for collaborative filtering
    Jon M. Kleinberg
    STOC (2004), pp. 569-578
    Network failure detection and graph connectivity
    Jon M. Kleinberg
    Aleksandrs Slivkins
    SODA (2004), pp. 76-85
    Convergent algorithms for collaborative filtering
    Jon M. Kleinberg
    ACM Conference on Electronic Commerce (2003), pp. 1-10