Jump to Content
Rahul Sukthankar

Rahul Sukthankar

http://www.cs.cmu.edu/~rahuls/bio.html
Publication list below is partial. For a complete list, please see: http://www.cs.cmu.edu/~rahuls/pub/.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    Preview abstract Vision Transformer (ViT) is emerging as the state-of-the-art architecture for image recognition. While recent studies suggest that ViTs are more robust than their convolutional counterparts, our experiments find that ViTs trained on ImageNet are overly reliant on local textures and fail to make adequate use of shape information. ViTs thus have difficulties generalizing to out-of-distribution, real-world data. To address this deficiency, we present a simple and effective architecture modification to ViT's input layer by adding discrete tokens produced by a vector-quantized encoder. Different from the standard continuous pixel tokens, discrete tokens are invariant under small perturbations and contain less information individually, which promote ViTs to learn global information that is invariant. Experimental results demonstrate that adding discrete representation on four architecture variants strengthens ViT robustness by up to 12% across seven ImageNet robustness benchmarks while maintaining the performance on ImageNet. View details
    THUNDR: Transformer-based 3D HUmaN Reconstruction with Markers
    Mihai Zanfir
    Andrei Zanfir
    Proceedings of the IEEE/CVF International Conference on Computer Vision (2021)
    Preview abstract We present THUNDR, a transformer-based deep neural network methodology to reconstruct the 3D pose and shape of people, given monocular RGB images. Key to our methodology is an intermediate 3D marker representation, where we aim to combine the predictive power of model-free output architectures and the regularizing, anthropometrically-preserving properties of a statistical human surface models like GHUM—a recently introduced, expressive full body statistical 3d human model, trained end-to-end. Our novel transformer-based prediction pipeline can focus on image regions relevant to the task, supports self-supervised regimes, and ensures that solutions are consistent with human anthropometry. We show state-of-the-art results on Human3.6M and 3DPW, for both the fully-supervised and the self-supervised models, for the task of inferring 3D human shape, joint positions, and global translation. Moreover, we observe very solid 3d reconstruction performance for difficult human poses collected in the wild. Models will be made available for research. View details
    Neural Descent for Visual 3D Human Pose and Shape
    Andrei Zanfir
    Mihai Zanfir
    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021), pp. 14484-14493
    Preview abstract We present a deep neural network methodology to reconstruct the 3d pose and shape of people, given image or video inputs. We rely on a recently introduced, expressive full body statistical 3d human model, GHUM, with facial expression and hand detail and aim to learn to reconstruct the model pose and shape states in a self-supervised regime. Central to our methodology, is a learning to learn approach, referred to as HUman Neural Descent (HUND) that avoids both second-order differentiation when training the model parameters, and expensive state gradient descent in order to accurately minimize a semantic differentiable rendering loss at test time. Instead, we rely on novel recurrent stages to update the pose and shape parameters such that not only losses are minimized effectively but the process is regularized in order to ensure progress. The newly introduced architecture is tested extensively, and achieves state-of-the-art results on datasets like H3.6M and 3DPW, as well as in complex imagery collected in-the-wild. View details
    Preview abstract Monocular 3D human pose and shape estimation is challenging due to the many degrees of freedom of the human body and thedifficulty to acquire training data for large-scale supervised learning incomplex visual scenes. In this paper we present practical semi-supervisedand self-supervised models that support training and good generalizationin real-world images and video. Our formulation is based on kinematiclatent normalizing flow representations and dynamics, as well as differ-entiable, semantic body part alignment loss functions that support self-supervised learning. In extensive experiments using 3D motion capturedatasets like CMU, Human3.6M, 3DPW, or AMASS, as well as imagerepositories like COCO, we show that the proposed methods outperformthe state of the art, supporting the practical construction of an accuratefamily of models based on large-scale training with diverse and incom-pletely labeled image and video data. View details
    GHUM & GHUML: Generative 3D Human Shape and Articulated Pose Models
    Hongyi Xu
    Andrei Zanfir
    IEEE/CVF Conference on Computer Vision and Pattern Recognition (Oral) (2020), pp. 6184-6193
    Preview abstract We present a statistical, articulated 3D human shape modeling pipeline, within a fully trainable, modular, deep learning framework. Given high-resolution complete 3D body scans of humans, captured in various poses, together with additional closeups of their head and facial expressions, as well as hand articulation, and given initial, artist designed, gender neutral rigged quad-meshes, we train all model parameters including non-linear shape spaces based on variational auto-encoders, pose-space deformation correctives, skeleton joint center predictors, and blend skinning functions, in a single consistent learning loop. The models are simultaneously trained with all the 3d dynamic scan data (over60,000diverse human configurations in our new dataset) in order to capture correlations and en-sure consistency of various components. Models support facial expression analysis, as well as body (with detailed hand) shape and pose estimation. We provide fully train-able generic human models of different resolutions – the moderate-resolution GHUM consisting of 10,168 vertices and the low-resolution GHUML(ite) of 3,194 vertices –, run comparisons between them, analyze the impact of different components and illustrate their reconstruction from image data. The models are available for research. View details
    Preview abstract Can we guess human action from dialogue alone? In this work we investigate the link between spoken words and actions in movies. We note that movie scripts describe actions, as well as contain the speech of characters and hence can be used to learn this correlation with no additional supervision. We train a speech to action classifier on 1k movie scripts downloaded from IMSDb and show that such a classifier performs well for certain classes, and when applied to the speech segments of a large \textit{unlabelled} movie corpus (288k videos, 188M speech segments), provides weak labels for over 800k video clips. By training on these video clips, we demonstrate superior action recognition performance on standard action recognition benchmarks, without using a single labelled action example. View details
    Preview abstract This paper focuses on multi-person action forecasting in videos. More precisely, given a history of H previous frames, the goal is to detect actors and to predict their future actions for the next T frames. Our approach jointly models temporal and spatial interactions among different actors by constructing a recurrent graph, using actor proposals obtained with Faster R-CNN as nodes. Our method learns to select a subset of discriminative relations without requiring explicit supervision, thus enabling us to tackle challenging visual data. We refer to our model as Discriminative Relational Recurrent Network (DRRN). Evaluation of action prediction on AVA demonstrates the effectiveness of our proposed method compared to simpler baselines. Furthermore, we significantly improve performance on the task of early action classification on J-HMDB, from the previous SOTA of 48% to 60%. View details
    Preview abstract Current state-of-the-art approaches for spatio-temporal action localization rely on detections at the frame level and model temporal context with 3D ConvNets. Here, we go one step further and model spatio-temporal relations to capture the interactions between human actors, relevant objects and scene elements essential to differentiate similar human actions. Our approach is weakly supervised and mines the relevant elements automatically with an actor-centric relational network (ACRN). ACRN computes and accumulates pair-wise relation information from actor and global scene features, and generates relation features for action classification. It is implemented as neural networks and can be trained jointly with an existing action detection system. We show that ACRN outperforms alternative approaches which capture relation information, and that the proposed framework improves upon the state-of-the-art performance on JHMDB and AVA. A visualization of the learned relation features confirms that our approach is able to attend to the relevant relations for each action. View details
    Preview abstract This paper introduces a video dataset of spatio-temporally localized Atomic Visual Actions (AVA). The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1.58M action labels with multiple labels per person occurring frequently. The key characteristics of our dataset are: (1) the definition of atomic visual actions, rather than composite actions; (2) precise spatio-temporal annotations with possibly multiple annotations for each person; (3) exhaustive annotation of these atomic actions over 15-minute video clips; (4) people temporally linked across consecutive segments; and (5) using movies to gather a varied set of action representations. This departs from existing datasets for spatio-temporal action recognition, which typically provide sparse annotations for composite actions in short video clips. We will release the dataset publicly. AVA, with its realistic scene and action complexity, exposes the intrinsic difficulty of action recognition. To benchmark this, we present a novel approach for action localization that builds upon the current state-of-the-art methods, and demonstrates better performance on JHMDB and UCF101-24 categories. While setting a new state of the art on existing datasets, the overall results on AVA are low at 15.6% mAP, underscoring the need for developing new approaches for video understanding. View details
    Preview abstract We propose TAL-Net, an improved approach to temporal action localization in video that is inspired by the Faster R-CNN object detection framework. TAL-Net addresses three key shortcomings of existing approaches: (1) we improve receptive field alignment using a multi-scale architecture that can accommodate extreme variation in action durations; (2) we better exploit the temporal context of actions for both proposal generation and action classification by appropriately extending receptive fields; and (3) we explicitly consider multi-stream feature fusion and demonstrate that fusing motion late is important. We achieve state-of-the-art performance for both action proposal and localiza- tion on THUMOS’14 detection benchmark and competitive performance on ActivityNet challenge. View details
    Robust Adversarial Reinforcement Learning
    Lerrel Pinto
    James Davidson
    Abhinav Gupta
    ICML (2017)
    Preview abstract Deep neural networks coupled with fast simulation and improved computation have led to recent successes in the field of reinforcement learning (RL). However, most current RL-based approaches fail to generalize since: (a) the gap between simulation and real world is so large that policy-learning approaches fail to transfer; (b) even if policy learning is done in real world, the data scarcity leads to failed generalization from training to test scenarios (e.g., due to different friction or object masses). Inspired from H-infinity control methods, we note that both modeling errors and differences in training and test scenarios can be viewed as extra forces/disturbances in the system. This paper proposes the idea of robust adversarial reinforcement learning (RARL), where we train an agent to operate in the presence of a destabilizing adversary that applies disturbance forces to the system. The jointly trained adversary is reinforced -- that is, it learns an optimal destabilization policy. We formulate the policy learning as a zero-sum, minimax objective function. Extensive experiments in multiple environments (InvertedPendulum, HalfCheetah, Swimmer, Hopper and Walker2d) conclusively demonstrate that our method (a) improves training stability; (b) is robust to differences in training/test conditions; and c) outperform the baseline even in the absence of the adversary. View details
    Preview abstract Real-time optimization of traffic flow addresses important practical problems: reducing a driver's wasted time, improving city-wide efficiency, reducing gas emissions and improving air quality. Much of the current research in traffic-light optimization relies on extending the capabilities of traffic lights to either communicate with each other or communicate with vehicles. However, before such capabilities become ubiquitous, opportunities exist to improve traffic lights by being more responsive to current traffic situations within the current, already deployed, infrastructure. In this paper, we introduce a traffic light controller that employs bidding within micro-auctions to efficiently incorporate traffic sensor information; no other outside sources of information are assumed. We train and test traffic light controllers on large-scale data collected from opted-in Android cell-phone users over a period of several months in Mountain View, California and the River North neighborhood of Chicago, Illinois. The learned auction-based controllers surpass (in both the relevant metrics of road-capacity and mean travel time) the currently deployed lights, optimized static-program lights, and longer-term planning approaches, in both cities, measured using real user driving data. View details
    Preview abstract We consider the problem of retrieving objects from image data and learning to classify them into meaningful semantic categories with minimal supervision. To that end, we propose a fully differentiable unsupervised deep clustering approach to learn semantic classes in an end-to-end fashion without individual class labeling using only unlabeled object proposals. The key contributions of our work are 1) a kmeans clustering objective where the clusters are learned as parameters of the network and are represented as memory units, and 2) simultaneously building a feature representation, or embedding, while learning to cluster it. This approach shows promising results on two popular computer vision datasets: on CIFAR10 for clustering objects, and on the more complex and challenging Cityscapes dataset for semantically discovering classes which visually correspond to cars, people, and bicycles. Currently, the only supervision provided is segmentation objectness masks, but this method can be extended to use an unsupervised objectness-based object generation mechanism which will make the approach completely unsupervised. View details
    Preview abstract We introduce a neural architecture for navigation in novel environments. Our proposed architecture learns to map from first-person viewpoints and plans a sequence of actions towards goals in the environment. The Cognitive Mapper and Planner (CMP) is based on two key ideas: a) a unified joint architecture for mapping and planning, such that the mapping is driven by the needs of the planner, and b) a spatial memory with the ability to plan given an incomplete set of observations about the world. CMP constructs a top-down belief map of the world and applies a differentiable neural net planner to produce the next action at each time step. The accumulated belief of the world enables the agent to track visited regions of the environment. Our experiments demonstrate that CMP outperforms both reactive strategies and standard memory-based architectures and performs well in novel environments. Furthermore, we show that CMP can also achieve semantically specified goals, such as 'go to a chair'. View details
    Preview abstract We propose SfM-Net, a geometry-aware neural network for motion estimation in videos that decomposes frame-toframe pixel motion in terms of scene and object depth, camera motion and 3D object rotations and translations. Given a sequence of frames, SfM-Net predicts depth, segmentation, camera and rigid object motions, converts those into a dense frame-to-frame motion field (optical flow), differentiably warps frames in time to match pixels and backpropagates. The model can be trained with various degrees of supervision: 1) completely unsupervised, 2) supervised by ego-motion (camera motion), 3) supervised by depth (e.g., as provided by RGBD sensors), 4) supervised by ground-truth optical flow. We show that SfM-Net successfully estimates segmentation of the objects in the scene, even though such supervision is never provided. It extracts meaningful depth estimates or infills depth of RGBD sensors and successfully estimates frame-to-frame camera displacements. SfM-Net achieves state-of-the-art optical flow performance. Our work is inspired by the long history of research in geometry-aware motion estimation, Simultaneous Localization and Mapping (SLAM) and Structure from Motion (SfM). SfM-Net is an important first step towards providing a learning-based approach for such tasks. A major benefit over the existing optimization approaches is that our proposed method can improve itself by processing more videos, and by learning to explicitly model moving objects in dynamic scenes. View details
    Preview abstract Learning a set of diverse and representative features from a large set of unlabeled data has long been an area of active research. We present a method that separates proposals of potential objects into semantic classes in an unsupervised manner. Our preliminary results show that different object categories emerge and can later be retrieved from test images. We propose a differentiable clustering approach which can be integrated with Deep Neural Networks to learn semantic classes in end-to-fashion without manual class labeling. View details
    Preview abstract Decades of research have been directed towards improving the timing of traffic lights. The ubiquity of cell phones among drivers has created the opportunity to design new sensors for traffic light controllers. These new sensors, which search for radio signals that are constantly emanating from cell phones, hold the hope of replacing the typical induction-loop sensors that are installed within road pavements. A replacement to induction sensors is desired as they require significant roadwork to install, frequent maintenance and checkups, are sensitive to proper repairs and installation work, and the construction techniques, materials, and even surrounding unrelated ground work can be sources of failure. However, before cell phone sensors can be widely deployed, users must become comfortable with the passive use of their cell phones by municipalities for this purpose. Despite complete anonymization, public privacy concerns may remain. This presents a chicken-and-egg problem: without showing the benefits of using cell phones for traffic monitoring, users may not be willing to allow this use. In this paper, we show that by carefully training the traffic light controllers, we can unlock the benefits of these sensors when only a small fraction of users allow their cell phones to be used. Surprisingly, even when there is only small percentage of opted-in users, the new traffic controllers provide large benefits to all drivers View details
    Preview abstract Feature selection is essential for effective visual recognition. We propose an efficient joint classifier learning and feature selection method that discovers sparse, compact representations of input features from a vast sea of candidates, with an almost unsupervised formulation. Our method requires only the following knowledge, which we call the feature sign—whether or not a particular feature has on average stronger values over positive samples than over negatives. We show how this can be estimated using as few as a single labeled training sample per class. Then, using these feature signs, we extend an initial supervised learning problem into an (almost) unsupervised clustering formulation that can incorporate new data without requiring ground truth labels. Our method works both as a feature selection mechanism and as a fully competitive classifier. It has important properties, low computational cost and excellent accuracy, especially in difficult cases of very limited training data. We experiment on large-scale recognition in video and show superior speed and performance to established feature selection approaches such as AdaBoost, Lasso, greedy forward-backward selection, and powerful classifiers such as SVM. View details
    Preview abstract We propose a method to discover the physical parts of an articulated object class (e.g. tiger, horse) from multiple videos. Since the individual parts of an object can move independently of one another, we discover them as object regions that consistently move relatively with respect to the rest of the object across videos. We then learn a location model of the parts and segment them accurately in the individual videos using an energy function that also enforces temporal and spatial consistency in the motion of the parts. Traditional methods for motion segmentation or non-rigid structure from motion cannot discover parts unless they display independent motion, since they operate on one video at a time. Our method overcomes this problem by discovering the parts across videos, which allows to discover them in videos where they move to videos where they do not. We evaluate our method on a new dataset of 32 videos of tigers and horses, where we significantly outperform state-of-the art motion segmentation on the task of part discovery (roughly twice the accuracy). View details
    Variable Rate Image Compression with Recurrent Neural Networks
    Sung Jin Hwang
    Damien Vincent
    Michele Covell
    International Conference on Learning Representations (2016)
    Preview abstract A large fraction of Internet traffic is now driven by requests from mobile devices with relatively small screens and often stringent bandwidth requirements. Due to these factors, it has become the norm for modern graphics-heavy websites to transmit low-resolution, low-bytecount image previews (thumbnails) as part of the initial page load process to improve apparent page responsiveness. Increasing thumbnail compression beyond the capabilities of existing codecs is therefore a current research focus, as any byte savings will significantly enhance the experience of mobile device users. Toward this end, we propose a general framework for variable-rate image compression and a novel architecture based on convolutional and deconvolutional LSTM recurrent networks. Our models address the main issues that have prevented autoencoder neural networks from competing with existing image compression algorithms: (1) our networks only need to be trained once (not per-image), regardless of input image dimensions and the desired compression rate; (2) our networks are progressive, meaning that the more bits are sent, the more accurate the image reconstruction; and (3) the proposed architecture is at least as efficient as a standard purpose-trained autoencoder for a given number of bits. On a large-scale benchmark of 32×32 thumbnails, our LSTM-based approaches provide better visual quality than (headerless) JPEG, JPEG2000 and WebP, with a storage size that is reduced by 10% or more. View details
    The Virtues of Peer Pressure: A Simple Method for Discovering High-Value Mistakes
    Michele Covell
    International Conference on Computer Analysis of Images and Patterns (2015)
    Preview abstract Much of the recent success of neural networks can be attributed to the deeper architectures that have become prevalent. However, the deeper architectures often yield unintelligible solutions, require enormous amounts of labeled data, and still remain brittle and easily broken. In this paper, we present a method to efficiently and intuitively discover input instances that are misclassified by well-trained neural networks. As in previous studies, we can identify instances that are so similar to previously seen examples such that the transformation is visually imperceptible. Additionally, unlike in previous studies, we can also generate mistakes that are significantly different from any training sample, while, importantly, still remaining in the space of samples that the network should be able to classify correctly. This is achieved by training a basket of N “peer networks” rather than a single network. These are similarly trained networks that serve to provide consistency pressure on each other. When an example is found for which a single network, S, disagrees with all of the other N − 1 networks, which are consistent in their prediction, that example is a potential mistake for S. We present a simple method to find such examples and demonstrate it on two visual tasks. The examples discovered yield realistic images that clearly illuminate the weaknesses of the trained models, as well as provide a source of numerous, diverse, labeled-training samples. View details
    Preview abstract Decades of research have been directed towards improving the timing of existing traffic lights. In many parts of the world where this research has been conducted, detailed maps of the streets and the precise locations of the traffic lights are publicly available. Continued timing research has recently been further spurred by the increasing ubiquity of personal cell-phone based GPS systems. Through their use, an enormous amount of travel tracks have been amassed — thus providing an easy source of real traffic data. Nonetheless, one fundamental piece of information remains absent that limits the quantification of the benefits of new approaches: the existing traffic light schedules and traffic light response behaviors. Unfortunately, deployed traffic light schedules are often not known. Rarely are they kept in a central database, and even when they are, they are often not easily obtainable. The alternative, manual inspection of a system of multiple traffic lights may be prohibitively expensive and time-consuming for many experimenters. Without the existing light schedules, it is difficult to ascertain the real-improvements that new traffic light algorithms and approaches will have — especially on traffic patterns that have not yet been encountered in the collected data. To alleviate this problem, we present an approach to estimating existing traffic light schedules based on collected GPS-travel tracks. We present numerous ways to test the results and comprehensively demonstrate them on both synthetic and real data. One of the many uses, beyond studying the effects of existing lights in previously unencountered traffic flow environments, is to serve as a realistic baseline for light timing and schedule optimization studies. View details
    Preview abstract We address the problem of fine-grained action localization from temporally untrimmed web videos. We assume that only weak video-level annotations are available for training. The goal is to use these weak labels to identify temporal segments corresponding to the actions, and learn models that generalize to unconstrained web videos. We find that web images queried by action names serve as well-localized highlights for many actions, but are noisily labeled. To solve this problem, we propose a simple yet effective method that takes weak video labels and noisy image labels as input, and generates localized action frames as output. This is achieved by cross-domain transfer between video frames and web images, using pre-trained deep convolutional neural networks. We then use the localized action frames to train action recognition models with long short-term memory networks. We collect a fine-grained sports action data set FGA-240 of more than 130,000 YouTube videos. It has 240 fine-grained actions under 85 sports activities. Convincing results are shown on the FGA-240 data set, as well as the THUMOS 2014 localization data set with untrimmed training videos. View details
    Micro-Auction-Based Traffic-Light Control: Responsive, Local Decision Making
    Michele Covell
    International Conference on Intelligent Transportation Systems (2015)
    Preview abstract Real-time, responsive optimization of traffic flow serves to address important practical problems: reducing drivers’ wasted time and improving city-wide efficiency, as well as reducing gas emissions and improving air quality. Much of the current research in traffic-light optimization relies on extending the capabilities of basic traffic lights to either communicate with each other or communicate with vehicles. However, before such capabilities become ubiquitous, opportunities exist to improve traffic lights by being more responsive to current traffic situations within the existing, deployed, infrastructure. In this paper, we use micro-auctions as the organizing principle with which to incorporate local induction loop information; no other outside sources of information are assumed. At every time step in which a phase change is permitted, each light conducts a decentralized, weighted, micro-auction to determine which phase to instantiate next. We test the lights on real-world data collected over a period of several weeks around the Mountain View, California area. In our simulations, the auction mechanisms based only on local sensor data surpass longer-term planning approaches that rely on widely placed sensors and communications. View details
    Video Object Discovery and Co-segmentation with Extremely Weak Supervision
    Le Wang
    Gang Hua
    Jianru Xue
    Nanning Zheng
    Proceedings of European Conference on Computer Vision (2014)
    Preview abstract Video object co-segmentation refers to the problem of simultaneously segmenting a common category of objects from multiple videos. Most existing video co-segmentation methods assume that all frames from all videos contain the target objects. Unfortunately, this assumption is rarely true in practice, particularly for large video sets, and existing methods perform poorly when the assumption is violated. Hence, any practical video object co-segmentation algorithm needs to identify the relevant frames containing the target object from all videos, and then co-segment the object only from these relevant frames. We present a spatiotemporal energy minimization formulation for simultaneous video object discovery and co-segmentation across multiple videos. Our formulation incorporates a spatiotemporal auto-context model, which is combined with appearance modeling for superpixel labeling. The superpixel-level labels are propagated to the frame level through a multiple instance boosting algorithm with spatial reasoning (Spatial-MILBoosting), based on which frames containing the video object are identified. Our method only needs to be bootstrapped with the frame-level labels for a few video frames (e.g., usually 1 to 3) to indicate if they contain the target objects or not. Experiments on three datasets validate the efficacy of our proposed method, which compares favorably with the state-of-the-art. View details
    DaMN – Discriminative and Mutually Nearest: Exploiting Pairwise Category Proximity for Video Action Recognition
    Rui Hou
    Amir Roshan Zamir
    Mubarak Shah
    Proceedings of European Conference on Computer Vision (2014)
    Preview abstract We propose a method for learning discriminative category-level features and demonstrate state-of-the-art results on large-scale action recognition in video. The key observation is that one-vs-rest classifiers, which are ubiquitously employed for this task, face challenges in separating very similar categories (such as running vs. jogging). Our proposed method automatically identifies such pairs of categories using a criterion of mutual pairwise proximity in the (kernelized) feature space, using a category-level similarity matrix where each entry corresponds to the one-vs-one SVM margin for pairs of categories. We then exploit the observation that while splitting such "Siamese Twin" categories may be difficult, separating them from the remaining categories in a two-vs-rest framework is not. This enables us to augment one-vs-rest classifiers with a judicious selection of "two-vs-rest" classifier outputs, formed from such discriminative and mutually nearest (DaMN) pairs. By combining one-vs-rest and two-vs-rest features in a principled probabilistic manner, we achieve state-of-the-art results on the UCF101 and HMDB51 datasets. More importantly, the same DaMN features, when treated as a mid-level representation also outperform existing methods in knowledge transfer experiments, both cross-dataset from UCF101 to HMDB51 and to new categories with limited training data (one-shot and few-shot learning). Finally, we study the generality of the proposed approach by applying DaMN to other classification tasks; our experiments show that DaMN outperforms related approaches in direct comparisons, not only on video action recognition but also on their original image dataset tasks. View details
    Large-scale Video Classification with Convolutional Neural Networks
    Andrej Karpathy
    Sanketh Shetty
    Li Fei-Fei
    Proceedings of International Computer Vision and Pattern Recognition (CVPR 2014), IEEE
    Preview abstract Convolutional Neural Networks (CNNs) have been established as a powerful class of models for image recognition problems. Encouraged by these results, we provide an extensive empirical evaluation of CNNs on large-scale video classification using a dataset of 1 million YouTube videos belonging to 487 classes. We study multiple approaches for extending the connectivity of a CNN in time domain to take advantage of local spatio-temporal information and suggest a multi-resolution, foveated architecture as a promising way of regularizing the learning problem and speeding up training. Our best spatio-temporal networks display significant performance improvements compared to strong feature-based baselines (55.3% to 63.9%), but only a surprisingly modest improvement compared to single-frame models (59.3% to 60.9%). We further study the generalization performance of our best model by retraining the top layers on the UCF-101 action Recognition dataset and observe significant performance improvements compared to the UCF-101 baseline model (63.3% up from 43.9%). View details
    Recognition of Complex Events: Exploiting Temporal Dynamics between Underlying Concepts
    Subhabrata Bhattacharya
    Mahdi M. Kalayeh
    Mubarak Shah
    Proceedings of International Computer Vision and Pattern Recognition (CVPR 2014), IEEE
    Preview abstract While approaches based on bags of features excel at low-level action classification, they are ill-suited for recognizing complex events in video, where concept-based temporal representations currently dominate. This paper proposes a novel representation that captures the temporal dynamics of windowed mid-level concept detectors in order to improve complex event recognition. We first express each video as an ordered vector time series, where each time step consists of the vector formed from the concatenated confidences of the pre-trained concept detectors. We hypothesize that the dynamics of time series for different instances from the same event class, as captured by simple linear dynamical system (LDS) models, are likely to be similar even if the instances differ in terms of low-level visual features. We propose a two-part representation composed of fusing: (1) a singular value decomposition of block Hankel matrices (SSID-S) and (2) a harmonic signature (H-S) computed from the corresponding eigen-dynamics matrix. The proposed method offers several benefits over alternate approaches: our approach is straightforward to implement, directly employs existing concept detectors and can be plugged into linear classification frameworks. Results on standard datasets such as NIST's TRECVID Multimedia Event Detection task demonstrate the improved accuracy of the proposed method. View details
    Spatiotemporal Deformable Part Models for Action Detection
    Yicong Tian
    Mubarak Shah
    Proceedings of International Conference on Computer Vision and Pattern Recognition (CVPR 2013)
    Preview abstract Deformable part models have achieved impressive performance for object detection, even on difficult image datasets. This paper explores the generalization of deformable part models from 2D images to 3D spatiotemporal volumes to better study their effectiveness for action detection in video. Actions are treated as spatiotemporal patterns and a deformable part model is generated for each action from a collection of examples. For each action model, the most discriminative 3D subvolumes are automatically selected as parts and the spatiotemporal relations between their locations are learned. By focusing on the most distinctive parts of each action, our models adapt to intra-class variation and show robustness to clutter. Extensive experiments on several video datasets demonstrate the strength of spatiotemporal DPMs for classifying and localizing actions. View details
    Discriminative Segment Annotation in Weakly Labeled Video
    Kevin Tang
    Li Fei-Fei
    Proceedings of International Conference on Computer Vision and Pattern Recognition (CVPR 2013)
    Preview abstract paper tackles the problem of segment annotation in complex Internet videos. Given a weakly labeled video, we automatically generate spatiotemporal masks for each of the concepts with which it is labeled. This is a particularly relevant problem in the video domain, as large numbers of YouTube videos are now available, tagged with the visual concepts that they contain. Given such weakly labeled videos, we focus on the problem of spatiotemporal segment classification. We propose a straightforward algorithm, CRANE, that utilizes large amounts of weakly labeled video to rank spatiotemporal segments by the likelihood that they correspond to a given visual concept. We make publicly available segment-level annotations for a subset of the Prest et al. dataset and show convincing results. We also show state-of-the-art results on Hartmann et al.'s more difficult, large-scale object segmentation dataset. View details
    Multi-Armed Recommendation Bandits for Selecting State Machine Policies for Robotic Systems
    Pyry Matikainen
    P. Michael Furlong
    Martial Hebert
    Proceedings of International Conference on Robotics and Automation (ICRA 2013)
    Preview abstract We investigate the problem of selecting a state-machine from a library to control a robot. We are particularly interested in this problem when evaluating such state machines on a particular robotics task is expensive. As a motivating example, we consider a problem where a simulated vacuuming robot must select a driving state machine well-suited for a particular (unknown) room layout. By borrowing concepts from collaborative filtering (recommender systems such as Netflix and Amazon.com), we present a multi-armed bandit formulation that incorporates recommendation techniques to efficiently select state machines for individual room layouts. We show that this formulation outperforms the individual approaches (recommendation, multi-armed bandits) as well as the baseline of selecting the `average best' state machine across all rooms. View details
    D-Nets: Beyond Patch-Based Image Descriptors
    Felix von Hundelshausen
    IEEE International Conference on Computer Vision and Pattern Recognition (CVPR'12) (2012)
    Preview abstract Despite much research on patch-based descriptors, SIFT remains the gold standard for finding correspondences across images and recent descriptors focus primarily on improving speed rather than accuracy. In this paper we propose Descriptor-Nets (D-Nets), a computationally efficient method that significantly improves the accuracy of image matching by going beyond patch-based approaches. D-Nets constructs a network in which nodes correspond to traditional sparsely or densely sampled keypoints, and where image content is sampled from selected edges in this net. Not only is our proposed representation invariant to cropping, translation, scale, reflection and rotation, but it is also significantly more robust to severe perspective and non-linear distortions. We present several variants of our algorithm, including one that tunes itself to the image complexity and an efficient parallelized variant that employs a fixed grid. Comprehensive direct comparisons against SIFT and ORB on standard datasets demonstrate that D-Nets dominates existing approaches in terms of precision and recall while retaining computational efficiency. View details
    Model Recommendation for Action Recognition
    Pyry Matikainen
    Martial Hebert
    IEEE International Conference on Computer Vision and Pattern Recognition (CVPR'12) (2012)
    Preview abstract Simply choosing one model out of a large set of possibilities for a given vision task is a surprisingly difficult problem, especially if there is limited evaluation data with which to distinguish among models, such as when choosing the best ``walk'' action classifier from a large pool of classifiers tuned for different viewing angles, lighting conditions, and background clutter. In this paper we suggest that this problem of selecting a good model can be recast as a recommendation problem, where the goal is to recommend a good model for a particular task based on how well a limited probe set of models appears to perform. Through this conceptual remapping, we can bring to bear all the collaborative filtering techniques developed for consumer recommender systems (e.g., Netflix, Amazon.com). We test this hypothesis on action recognition, and find that even when every model has been directly rated on a training set, recommendation finds better selections for the corresponding test set than the best performers on the training set. View details
    Weakly Supervised Learning of Object Segmentations from Web-Scale Video
    Glenn Hartmann
    Judy Hoffman
    David Tsai
    Omid Madani
    James Rehg
    ECCV'12 Proceedings of the 12th international conference on Computer Vision - Volume Part I, Springer-Verlag, Berlin, Heidelberg (2012), pp. 198-208
    Preview abstract We propose to learn pixel-level segmentations of objects from weakly labeled (tagged) internet videos. Specifically, given a large collection of raw YouTube content, along with potentially noisy tags, our goal is to automatically generate spatiotemporal masks for each object, such as "dog", without employing any pre-trained object detectors. We formulate this problem as learning weakly supervised classifiers for a set of independent spatio-temporal segments. The object seeds obtained using segment-level classifiers are further refined using graphcuts to generate high-precision object masks. Our results, obtained by training on a dataset of 20,000 YouTube videos weakly tagged into 15 classes, demonstrate automatic extraction of pixel-level object masks. Evaluated against a ground-truthed subset of 50,000 frames with pixel-level annotations, we confirm that our proposed methods can learn good object masks just by watching YouTube. View details
    Efficient Closed-Form Solution to Generalized Boundary Detection
    Marius Leordeanu
    Proceedings of European Conference on Computer Vision (ECCV'12) (2012)
    Preview abstract Boundary detection is essential for a variety of computer vision tasks such as segmentation and recognition. We propose a unified formulation for boundary detection, with closed-form solution, which is applicable to the localization of different types of boundaries, such as intensity edges and occlusion boundaries from video and RGB-D cameras. Our algorithm simultaneously combines low- and mid-level image representations, in a single eigenvalue problem, and we solve over an infinite set of putative boundary orientations. Moreover, our method achieves state of the art results at a significantly lower computational cost than current methods. We also propose a novel method for soft-segmentation that can be used in conjunction with our boundary detection algorithm and improve its accuracy at a negligible extra computational cost. View details
    Unsupervised Learning for Graph Matching
    Marius Leordeanu
    Martial Hebert
    International Journal of Computer Vision, vol. 96 (2012), pp. 28-45
    Preview abstract Graph matching is an essential problem in computer vision that has been successfully applied to 2D and 3D feature matching and object recognition. Despite its importance, little has been published on learning the parameters that control graph matching, even though learning has been shown to be vital for improving the matching rate. In this paper, we show how to perform parameter learning in an unsupervised fashion, that is when no correct correspondences between graphs are given during training. Our experiments reveal that unsupervised learning compares favorably to the supervised case, both in terms of efficiency and quality, while avoiding the tedious manual labeling of ground truth correspondences. We verify experimentally that our learning method can improve the performance of several state-of-the-art matching algorithms. We also show that a similar method can be successfully applied to parameter learning for graphical models and demonstrate its effectiveness empirically. View details
    Feature Seeding for Action Recognition
    Pyry Matikainen
    Martial Hebert
    International Conference on Computer Vision (ICCV) (2011)
    Preview abstract Progress in action recognition has been in large part due to advances in the features that drive learning-based methods. However, the relative sparsity of training data and the risk of overfitting have made it difficult to directly search for good features. In this paper, we suggest using synthetic data to search for robust features that can more easily take advantage of limited data, rather than using the synthetic data directly as a substitute for real data. We demonstrate that the features discovered by our selection method, which we call seeding, improve performance on an action classification task on real data, even though the synthetic data from which our features are seeded differs significantly from the real data, both in terms of appearance and the set of action classes. View details
    Discriminative Cluster Refinement: Improving Object Category Recognition Given Limited Training Data
    Liu Yang
    Rong Jin
    Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2007)
    An Efficient Algorithm for Local Distance Metric Learning
    Liu Yang
    Rong Jin
    Yi Liu
    AAAI (2006)
    Distributed localization of networked cameras
    Stanislav Funiak
    Carlos Guestrin
    Mark Paskin
    IPSN (2006), pp. 34-42
    Dynamic Load Balancing for Distributed Search
    Larry Huston
    Alex Nizhner
    Rahul Sukthankar
    P. Steenkiste
    14th Symposium on High Performance Distributed Computing (HPDC) (2005)
    Network-Aware Partitioning of Computation in Diamond
    Alex Nizhner
    Larry Huston
    Peter Steenkiste
    Carnegie Mellon University (2004)
    Memory-Based Face Recognition for Visitor Identification
    Terence Sim
    Matthew Mullin
    FG (2000), pp. 214-220
    Applying Machine Learning for High-Performance Named-Entity Extraction
    Vibhu O. Mittal
    Computational Intelligence, vol. 16 (2000), pp. 586-596