Matthew Brown
Research Areas
Authored Publications
Google Publications
Other Publications
Sort By
MoViNets: Mobile Video Networks for Efficient Video Recognition
Dan Kondratyuk
Liangzhe Yuan
Yandong Li
Li Zhang
Boqing Gong
CVPR 2021(2021)
Preview abstract
We present Mobile Video Networks (MoViNets), a family of computation and memory efficient video networks that can operate on streaming video for online inference. 3D convolutional neural networks (CNNs) are accurate at video recognition but require large computation and memory budgets and do not support online inference, making them difficult to work on mobile devices. We propose a three-step approach to improve computational efficiency while substantially reducing the peak memory usage of 3D CNNs. First, we design a video network search space and employ neural architecture search to generate efficient and diverse 3D CNN architectures. Second, we introduce the Stream Buffer technique that decouples memory from video clip duration, allowing 3D CNNs to embed arbitrary-length streaming video sequences for both training and inference with a small constant memory footprint. Third, we propose a simple ensembling technique to improve accuracy further without sacrificing efficiency. These three progressive techniques allow MoViNets to achieve state-of-the-art accuracy and efficiency on the Kinetics, Moments in Time, and Charades video action recognition datasets. For instance, MoViNet-A5-Stream achieves the same accuracy as X3D-XL on Kinetics 600 while requiring 80% fewer FLOPs and 65% less memory. Code is available at https://github.com/tensorflow/models/tree/master/official/projects/movinet.
View details
FiG-NeRF: Figure-Ground Neural Radiance Fields for 3D Object Category Modelling
Christopher Xie
Keunhong Park
Ricardo Martin Brualla
International Conference on 3D Vision(2021) (to appear)
Preview abstract
We investigate the use of Neural Radiance Fields (NeRF) to learn high quality 3D object category models from collections of input images. In contrast to previous work, we are able to do this whilst simultaneously separating foreground objects from their varying backgrounds. We achieve this via a 2-component NeRF model, FiG-NeRF, that prefers explanation of the scene as a geometrically constant background and a deformable foreground that represents the object category. We show that this method can learn accurate 3D object category models using only photometric supervision and casually captured images of the objects. Additionally, our 2-part decomposition allows the model to perform accurate and crisp amodal segmentation. We quantitatively evaluate our method with view synthesis and image fidelity metrics, using synthetic, lab-captured, and in-the-wild data. Our results demonstrate convincing 3D object category modelling that exceeds the performance of existing methods.
View details
When Ensembling Smaller Models is More Efficient than Single Large Models
Dan Kondratyuk
Boqing Gong
Visual Understanding by Learning from Web Data 2020, CVPR(2020)
Preview abstract
Ensembling is a simple and popular technique for boosting evaluation performance by training multiple models (e.g., with different initializations) and aggregating their predictions. This approach is commonly reserved for the largest models, as it is commonly held that increasing the model size provides a more substantial reduction in error than ensembling smaller models. However, we show results from experiments on CIFAR-10 and ImageNet that ensembles can outperform single models with both higher accuracy and requiring fewer total FLOPs to compute, even when those individual models' weights and hyperparameters are highly optimized. Furthermore, this gap in improvement widens as models become large. This presents an interesting observation that output diversity in ensembling can often be more efficient than training larger models, especially when the models approach the size of what their dataset can foster. Instead of using the common practice of tuning a single large model, one can use ensembles as a more flexible trade-off between a model's inference speed and accuracy. This also potentially eases hardware design, e.g., an easier way to parallelize the model across multiple workers for real-time or distributed inference.
View details
GeLaTO: Generative Latent Textured Objects
Ricardo Martin Brualla
Sofien Bouaziz
Dan B Goldman
European Conference on Computer Vision(2020)
Preview abstract
Accurate modeling of 3D objects exhibiting transparency, reflections and thin structures is an extremely challenging problem. Inspired by billboards and geometric proxies used in computer graphics, this paper proposes Generative Latent Textured Objects (GeLaTO), a compact representation that combines a set of coarse shape proxies defining low frequency geometry with learned neural textures, to encode both medium and fine scale geometry as well as view-dependent appearance. To generate the proxies' textures, we learn a joint latent space allowing category-level appearance and geometry interpolation. The proxies are independently rasterized with their corresponding neural texture and composited using a U-Net, which generates an output photorealistic image including an alpha map. We demonstrate the effectiveness of our approach by reconstructing complex objects from a sparse set of views. We show results on a dataset of real images of eyeglasses frames, which are particularly challenging to reconstruct with classical methods. We also demonstrate that these coarse proxies can be handcrafted when the underlying object geometry is easy to model, like eyeglasses, or generated using a neural network for more complex categories, such as cars.
View details
Preview abstract
Federated Learning enables visual models to be trained on-device, bringing advantages for user privacy (data need never leave the device), but challenges in terms of data diversity and quality. Whilst typical models in the datacenter are trained using data that are independent and identically distributed (IID), data at source are typically far from IID. In this work, we characterize the effect this non-identical distribution has on distributed learning, using as a benchmark the standard Federated Averaging (FedAvg) algorithm. To do so, we introduce two new large-scale datasets for species and landmark classification, with realistic per-user data splits that simulate real-world edge learning scenarios. We also develop two new algorithms (FedVC, FedIR) that intelligently resample and reweight over the client pool, bringing large improvements in accuracy and stability in training.
View details
Preview abstract
Federated Learning brings the possibility to train visual models in a privacy-preserving way using real-world data on mobile devices. Given their distributed nature, the statistics of the data across these devices is likely to differ significantly. In this work, we look at the effect such non-identical data distributions has on visual classification via Federated Learning. We propose a way to synthesize datasets with a continuous range of identicalness and provide performance measures for the Federated Averaging algorithm. We also provide an improvement to the algorithm when its performance falls off. Experiments on the CIFAR-10 dataset show that such modifications lead to better learning on all setups. In highly skewed settings, we are able to improve performance up to 166%, achieving comparable results to traditional data-center learning in all but the most extreme cases.
View details
Preview abstract
Human vision is able to immediately recognize novel visual categories after seeing just one or a few training examples. We describe how to add a similar capability to ConvNet classifiers by directly setting the final layer weights from novel training examples during low-shot learning. We call this process \emph{weight imprinting} as it directly sets penultimate layer weights based on an appropriately scaled copy of their activations for that training example. The imprinting process provides a valuable complement to training with stochastic gradient descent, as it provides immediate good classification performance and an initialization for any further fine tuning in the future. We show how this imprinting process is related to proxy-based embeddings. However, it differs in that only a single imprinted weight vector is learned for each novel category, rather than relying on a nearest-neighbor distance to training instances as typically used with embedding methods. Our experiments show that using averaging of imprinted weights provides better generalization than using nearest-neighbor instance embeddings. A key change to traditional ConvNet classifiers is the introduction of a scaled normalization layer that allows activations to be directly imprinted as weights.
View details
Preview abstract
Recent advances in video super-resolution have shown that convolutional neural networks combined with motion compensation are able to merge information from multiple low-resolution (LR) frames to create high-quality results. Current state-of-the-art methods process a batch of LR frames to generate a single high-resolution (HR) frame and run this scheme in a sliding window fashion over the entire video, effectively treating the problem as many independent multi-frame super-resolution tasks. This approach has two main weaknesses: 1) Each input frame is processed and warped multiple times, leading to redundant computations, and 2) each output frame is estimated independently, limiting the system's ability to produce temporally consistent results.
In this work, we propose an end-to-end trainable frame-recursive video super-resolution framework that uses the previously inferred HR estimate to super-resolve the subsequent frame. This naturally encourages temporally consistent results and avoids redundant computations by warping only one image in each step. Furthermore, due to its recurrent nature, the proposed method has the ability to assimilate a large number of previous frames without increased computational demands. Extensive evaluations and comparisons with previous methods validate the strengths of our approach and demonstrate that the proposed framework is able to significantly outperform the current state of the art.
View details
Preview abstract
This paper presents a weakly-supervised approach to object instance segmentation. Starting with known or predicted object bounding boxes, we learn object masks by playing a game of cut-and-paste in an adversarial learning setup. A mask generator takes a detection box and Faster R-CNN features, and constructs a segmentation mask that is used to cut-and-paste the object into a new image location. The discriminator tries to distinguish between real objects, and those cut and pasted via the generator, giving a learning signal that leads to improved object masks. We verify our method experimentally using Cityscapes, COCO, and aerial image datasets, learning to segment objects without ever having seen a mask in training. Our method exceeds the performance of existing weakly supervised methods, without requiring hand-tuned segment proposals, and reaches 90% of supervised performance.
View details
Enhancing Video Summarization via Vision-Language Embedding
Bryan Plummer
Svetlana Lazebnik
IEEE International Conference on Computer Vision and Pattern Recognition(2017)
Preview abstract
This paper addresses video summarization, or the problem
of distilling a raw video into a shorter form while still
capturing the original story. We show that visual representations
supervised by freeform language make a good
fit for this application by extending a recent submodular
summarization approach with representativeness and
interestingness objectives computed on features from a joint
vision-language embedding space. We perform an evaluation
on two diverse datasets, UT Egocentric and
TV Episodes, and show that our new objectives give
improved summarization ability compared to standard visual
features alone. Our experiments also show that the
vision-language embedding need not be trained on domainspecific
data, but can be learned from standard still image
vision-language datasets and transferred to video. A further
benefit of our model is the ability to guide a summary using
freeform text input at test time, allowing user customization.
View details
Unsupervised Learning of Depth and Ego-Motion from Video
Tinghui Zhou
Noah Snavely
David Lowe
Computer Vision and Pattern Recognition, IEEE(2017)
Preview abstract
We present an unsupervised learning framework for the
task of monocular depth and camera motion estimation
from unstructured video sequences. In common with recent
work, we use an end-to-end learning approach
with view synthesis as the supervisory signal. In
contrast to the previous work, our method is completely unsupervised,
requiring only monocular video sequences for
training. Our method uses single-view depth and multi-view
pose networks, with a loss based on warping nearby
views to the target using the computed depth and pose. The
networks are thus coupled by the loss during training, but
can be applied independently at test time. Empirical evaluation
on the KITTI dataset demonstrates the effectiveness
of our approach: 1) monocular depth performs comparably
with supervised methods that use either ground-truth pose
or depth for training, and 2) pose estimation performs favorably
compared to established SLAM systems under comparable
input settings.
View details
No Results Found