Jump to Content

Publications

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

people standing in front of a screen with images and a chipboard

Publications

Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
1 - 15 of 291 publications
    Preview abstract Reconstruction of neural circuits from volume electron microscopy data requires the tracing of cells in their entirety, including all their neurites. Automated approaches have been developed for tracing, but their error rates are too high to generate reliable circuit diagrams without extensive human proofreading. We present flood-filling networks, a method for automated segmentation that, similar to most previous efforts, uses convolutional neural networks, but contains in addition a recurrent pathway that allows the iterative optimization and extension of individual neuronal processes. We used flood-filling networks to trace neurons in a dataset obtained by serial block-face electron microscopy of a zebra finch brain. Using our method, we achieved a mean error-free neurite path length of 1.1 mm, and we observed only four mergers in a test set with a path length of 97 mm. The performance of flood-filling networks was an order of magnitude better than that of previous approaches applied to this dataset, although with substantially increased computational costs. View details
    Preview abstract We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody. We show that conditioning Tacotron on this learned embedding space results in synthesized audio that matches the reference signal’s prosody with fine time detail. We define several quantitative and subjective metrics for evaluating prosody transfer, and report results and audio samples from a single-speaker and 44-speaker Tacotron model on a prosody transfer task. View details
    Burst Denoising with Kernel Prediction Networks
    Ben Mildenhall
    Jiawen Chen
    Dillon Sharlet
    Ren Ng
    Rob Carroll
    CVPR (2018) (to appear)
    Preview abstract We present a technique for jointly denoising bursts of images taken from a handheld camera. In particular, we propose a convolutional neural network architecture for predicting spatially varying kernels that can both align and denoise frames, a synthetic data generation approach based on a realistic noise formation model, and an optimization guided by an annealed loss function to avoid undesirable local minima. Our model matches or outperforms the state-of-the-art across a wide range of noise levels on both real and synthetic data. View details
    COCO-Stuff: Thing and Stuff Classes in Context
    Holger Caesar
    Vittorio Ferrari
    CVPR (2018) (to appear)
    Preview abstract Semantic classes can be either things (objects with a well-defined shape, e.g. car, person) or stuff (amorphous background regions, e.g. grass, sky). While lots of classifi- cation and detection works focus on thing classes, less at- tention has been given to stuff classes. Nonetheless, stuff classes are important as they allow to explain important aspects of an image, including (1) scene type; (2) which thing classes are likely to be present and their location (through contextual reasoning); (3) physical attributes, ma- terial types and geometric properties of the scene. To un- derstand stuff and things in context we introduce COCO- Stuff, which augments 120,000 images of the COCO dataset with pixel-wise annotations for 91 stuff classes. We introduce an efficient stuff annotation protocol based on superpixels which leverages the original thing annotations. We quantify the speed versus quality trade-off of our protocol and explore the relation be- tween annotation time and boundary complexity. Further- more, we use COCO-Stuff to analyze: (a) the importance of stuff and thing classes in terms of their surface cover and how frequently they are mentioned in image captions; (b) the spatial relations between stuff and things, highlighting the rich contextual relations that make our dataset unique; (c) the performance of a modern semantic segmentation method on stuff and thing classes, and whether stuff is easier to segment than things. View details
    Preview abstract Transferring the knowledge learned from large scale datasets (e.g., ImageNet) via fine-tuning offers an effective solution for domain-specific fine-grained visual categorization (FGVC) tasks (e.g., recognizing bird species or car make & model). In such scenarios, data annotation often calls for specialized domain knowledge and thus difficult to scale. In this work, we first tackle a problem in large scale FGVC. Our method won first place in iNaturalist 2017 large scale species classification challenge. Central to the success of our approach is a training scheme that uses higher image resolution and deals with the long-tailed distribution of training data. Next, we study transfer learning via fine-tuning from large scale datasets to small scale, domainspecific FGVC datasets. We propose a measure to estimate domain similarity via Earth Mover’s Distance and demonstrate that transfer learning benefits from pre-training on a source domain that is similar to the target domain by this measure. Our proposed transfer learning outperforms ImageNet pre-training and obtains state-of-the-art results on multiple commonly used FGVC datasets. View details
    Stereo Magnification: Learning view synthesis using multiplane images
    Tinghui Zhou
    John Flynn
    Graham Fyffe
    ACM Trans. Graph. (Proc. SIGGRAPH), vol. 37 (2018)
    Preview abstract The view synthesis problem—generating novel views of a scene from known imagery—has garnered recent attention due in part to compelling applications in virtual and augmented reality. In this paper, we explore an intriguing scenario for view synthesis: extrapolating views from imagery captured by narrow-baseline stereo cameras, including dual-lens camera phones and VR cameras. We call this problem stereo magnification, and propose a new learning framework that leverages a new layered representation that we call multiplane images (MPIs), as well as a massive new data source for learning view extrapolation: online videos on YouTube. Using data mined from such videos, we train a deep network that predicts an MPI from an input stereo image pair. This inferred MPI can then be used to synthesize a range of novel views of the scene, including views that extrapolate significantly beyond the input baseline. We show that our method compares favorably with several recent view synthesis methods, and demonstrate applications in magnifying narrow-baseline stereo images. View details
    Thoracic Disease Identification and Localization with Limited Supervision
    Chong Wang
    Mei Han
    Wei Wei
    Jia Li
    Fei-Fei Li
    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)
    Preview abstract Accurate identification and localization of abnormalities from radiology images play an integral part in clinical diagnosis and treatment planning. Building a highly accurate prediction model for these tasks usually requires a large number of images manually annotated with labels and finding sites of abnormalities. In reality, however, such annotated data are expensive to acquire, especially the ones with location annotations. We need methods that can work well with only a small amount of location annotations. To address this challenge, we present a unified approach that simultaneously performs disease identification and localization through the same underlying model for all images. We demonstrate that our approach can effectively leverage both class information as well as limited location annotation, and significantly outperforms the comparative reference baseline in both classification and localization tasks. View details
    Preview abstract The Rapid and Accurate Image Super Resolution (RAISR) method of Romano, Isidoro, and Milanfar is a computationally efficient image upscaling method using a trained set of filters. We describe a generalization of RAISR, which we name Best Linear Adaptive Enhancement (BLADE). This approach is a trainable edge-adaptive filtering framework that is general, simple, computationally efficient, and useful for a wide range of image processing problems. We show applications to denoising, compression artifact removal, demosaicing, and approximation of anisotropic diffusion equations. View details
    Preview abstract This paper introduces a video dataset of spatio-temporally localized Atomic Visual Actions (AVA). The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1.58M action labels with multiple labels per person occurring frequently. The key characteristics of our dataset are: (1) the definition of atomic visual actions, rather than composite actions; (2) precise spatio-temporal annotations with possibly multiple annotations for each person; (3) exhaustive annotation of these atomic actions over 15-minute video clips; (4) people temporally linked across consecutive segments; and (5) using movies to gather a varied set of action representations. This departs from existing datasets for spatio-temporal action recognition, which typically provide sparse annotations for composite actions in short video clips. We will release the dataset publicly. AVA, with its realistic scene and action complexity, exposes the intrinsic difficulty of action recognition. To benchmark this, we present a novel approach for action localization that builds upon the current state-of-the-art methods, and demonstrates better performance on JHMDB and UCF101-24 categories. While setting a new state of the art on existing datasets, the overall results on AVA are low at 15.6% mAP, underscoring the need for developing new approaches for video understanding. View details
    Preview abstract Human vision is able to immediately recognize novel visual categories after seeing just one or a few training examples. We describe how to add a similar capability to ConvNet classifiers by directly setting the final layer weights from novel training examples during low-shot learning. We call this process \emph{weight imprinting} as it directly sets penultimate layer weights based on an appropriately scaled copy of their activations for that training example. The imprinting process provides a valuable complement to training with stochastic gradient descent, as it provides immediate good classification performance and an initialization for any further fine tuning in the future. We show how this imprinting process is related to proxy-based embeddings. However, it differs in that only a single imprinted weight vector is learned for each novel category, rather than relying on a nearest-neighbor distance to training instances as typically used with embedding methods. Our experiments show that using averaging of imprinted weights provides better generalization than using nearest-neighbor instance embeddings. A key change to traditional ConvNet classifiers is the introduction of a scaled normalization layer that allows activations to be directly imprinted as weights. View details
    In Silico Labeling: Predicting Fluorescent Labels in Unlabeled Images
    Eric Christiansen
    Mike Ando
    Ashkan Javaherian
    Gaia Skibinski
    Scott Lipnick
    Elliot Mount
    Alison O'Neil
    Kevan Shah
    Alicia K. Lee
    Piyush Goyal
    Liam Fedus
    Andre Esteva
    Lee Rubin
    Steven Finkbeiner
    Cell (2018)
    Preview abstract Imaging is a central method in life sciences, and the drive to extract information from microscopy approaches has led to methods to fluorescently label specific cellular constituents. However, the specificity of fluorescent labels varies, labeling can confound biological measurements, and spectral overlap limits the number of labels to a few that can be resolved simultaneously. Here, we developed a deep learning computational approach called “in silico labeling (ISL)” that reliably infers information from unlabeled biological samples that would normally require invasive labeling. ISL predicts different labels in multiple cell types from independent laboratories. It makes cell type predictions by integrating in silico labels, and is not limited by spectral overlap. The network learned generalized features, enabling it to solve new problems with small training datasets. Thus, ISL provides biological insights from images of unlabeled samples for negligible additional cost that would be undesirable or impossible to measure directly. View details
    Learning Intelligent Dialogs for Bounding-Box Annotation
    Ksenia Konyushkova
    Chris Lampert
    Vittorio Ferrari
    CVPR (2018) (to appear)
    Preview abstract We introduce Intelligent Annotation Dialogs for bound- ing box annotation. We train an agent to automatically choose a sequence of actions for a human annotator to pro- duce a bounding box in a minimal amount of time. Specifi- cally, we consider two actions: box verification [34], where the annotator verifies a box generated by an object detector, and manual box drawing. We explore two kinds of agents, one based on predicting the probability that a box will be positively verified, and the other based on reinforcement learning. We demonstrate that (1) our agents are able to learn efficient annotation strategies in several scenarios, automatically adapting to the difficulty of an input image, the desired quality of the boxes, the strenght of the detector, and other factors; (2) in all scenarios the resulting annota- tion dialogs speed up annotation compated to manual box drawing alone and box verification alone, while also out- performing any fixed combination of verification and draw- ing in most scenarios; (3) in a realistic scenario where the detector is iteratively re-trained, our agents evolve a series of strategies that reflect the shifting trade-off between veri- fication and drawing as the detector grows stronger. View details
    Preview abstract Even in the absence of any explicit semantic annotation, vast collections of audio recordings provide valuable information for learning the categorical structure of sounds. We consider several class-agnostic semantic constraints that apply to unlabeled nonspeech audio: (i) noise and translations in time do not change the underlying sound category, (ii) a mixture of two sound events inherits the categories of the constituents, and (iii) the categories of events in close temporal proximity are likely to be the same or related. Without labels to ground them, these constraints are incompatible with classification loss functions. However, they may still be leveraged to identify geometric inequalities needed for triplet loss-based training of convolutional neural networks. The result is low-dimensional embeddings of the input spectrograms that recover 41% and 84% of the performance of their fully-supervised counterparts when applied to downstream query-by-example sound retrieval and sound event classification tasks, respectively. Moreover, in limited-supervision settings, our unsupervised embeddings double the state-of-the-art classification performance. View details
    Fast, Trainable, Multiscale Denoising
    John Isidoro
    IEEE International Conference on Image Processing (ICIP) (2018) (to appear)
    Preview abstract Denoising is a fundamental imaging application. Versatile but fast filtering has been demanded for mobile camera systems. We present an approach to multiscale filtering which allows real-time applications on low-powered devices. The key idea is to learn a set of kernels that upscales, filters, and blends patches of different scales guided by local structure analysis. This approach is trainable so that learned filters are capable of treating diverse noise patterns and artifacts. Experimental results show that the presented approach produces comparable results to state-of-the-art algorithms while processing time is orders of magnitude faster. View details
    Aperture Supervision for Monocular Depth Estimation
    Pratul Srinivasan
    Neal Wadhwa
    Ren Ng
    CVPR (2018) (to appear)
    Preview abstract We present a novel method to train machine learning algorithms to estimate scene depths from a single image, by using the information provided by a camera's aperture as supervision. Prior works use a depth sensor's outputs or images of the same scene from alternate viewpoints as supervision, while our method instead uses images from the same viewpoint taken with a varying camera aperture. To enable learning algorithms to use aperture effects as supervision, we introduce two differentiable aperture rendering functions that use the input image and predicted depths to simulate the depth-of-field effects caused by real camera apertures. We train a monocular depth estimation network end-to-end to predict the scene depths that best explain these finite aperture images as defocus-blurred renderings of the input all-in-focus image. View details