Jump to Content
Caroline Pantofaru

Caroline Pantofaru

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    FILM: Frame Interpolation for Large Motion
    Fitsum Reda
    Eric Tabellion
    Proceedings of the European conference on computer vision (ECCV) (2022)
    Preview abstract We present a frame interpolation algorithm that synthesizes an engaging slow-motion video from near-duplicate photos which often exhibit large scene motion. Near-duplicates interpolation is an interesting new application, but large motion poses challenges to existing methods. To address this issue, we adapt a feature extractor that shares weights across the scales, and present a “scale-agnostic” motion estimator. It relies on the intuition that large motion at finer scales should be similar to small motion at coarser scales, which boosts the number of available pixels for large motion supervision. To inpaint wide disocclusions caused by large motion and synthesize crisp frames, we propose to optimize our network with the Gram matrix loss that measures the correlation difference between features. To simplify the training process, we further propose a unified single-network approach that removes the reliance on additional optical-flow or depth network and is trainable from frame triplets alone. Our approach outperforms state-of-the-art methods on the Xiph large motion benchmark while performing favorably on Vimeo90K, Middlebury and UCF101. Source codes and pre-trained models are available at https://film-net.github.io. View details
    Panoptic Neural Fields: A Semantic Object-Aware Neural Scene Representation
    Kyle Genova
    Xiaoqi Yin
    Leonidas Guibas
    Frank Dellaert
    Conference on Computer Vision and Pattern Recognition (2022)
    Preview abstract We present Panoptic Neural Fields (PNF), an object-aware neural scene representation that decomposes a scene into a set of objects (things) and background (stuff). Each object is represented by an oriented 3D bounding box and a multi-layer perceptron (MLP) that takes position, direction, and time and outputs density and radiance. The background stuff is represented by a similar MLP that additionally outputs semantic labels. Each object MLPs are instance-specific and thus can be smaller and faster than previous object-aware approaches, while still leveraging category-specific priors incorporated via meta-learned initialization. Our model builds a panoptic radiance field representation of any scene from just color images. We use off-the-shelf algorithms to predict camera poses, object tracks, and 2D image semantic segmentations. Then we jointly optimize the MLP weights and bounding box parameters using analysis-by-synthesis with self-supervision from color images and pseudo-supervision from predicted semantic segmentations. During experiments with real-world dynamic scenes, we find that our model can be used effectively for several tasks like novel view synthesis, 2D panoptic segmentation, 3D scene editing, and multiview depth prediction. View details
    Preview abstract The Open Images Dataset contains approximately 9 million images and is a widely accepted dataset for computer vision research. As is common practice for large datasets, the annotations are not exhaustive, with bounding boxes and attribute labels for only a subset of the classes in each image. In this paper, we present a new set of annotations on a subset of the Open Images dataset called the ``MIAP (More Inclusive Annotations for People)'' subset, containing bounding boxes and attributes for all of the people visible in those images. The attributes and labeling methodology for the ``MIAP'' subset were designed to enable research into model fairness. In addition, we analyze the original annotation methodology for the person class and its subclasses, discussing the resulting patterns in order to inform future annotation efforts. By considering both the original and exhaustive annotation sets, researchers can also now study how systematic patterns in training annotations affect modeling. View details
    Preview abstract Detecting objects in 3D LiDAR data is a core technology for autonomous driving and other robotics applications. Although LiDAR data is acquired over time, most of the 3D object detection algorithms propose object bounding boxes independently for each frame and neglect the useful information available in the temporal domain. To address this problem, in this paper we propose a sparse LSTM-based multi-frame 3d object detection algorithm. We use a U-Net style 3D sparse convolution network to extract features for each frame's LiDAR point-cloud. These features are fed to the LSTM module together with the hidden and memory features from last frame to predict the 3d objects in the current frame as well as hidden and memory features that are passed to the next frame. Experiments on the Waymo Open Dataset show that our algorithm outperforms the traditional frame by frame approach by 7.5% mAP@0.7 and other multi-frame approaches by 1.2% while using less memory and computation per frame. To the best of our knowledge, this is the first work to use an LSTM for 3D object detection in sparse point clouds. View details
    Preview abstract Semantic segmentation of 3D meshes is an important problem for 3D scene understanding. In this paper we revisit the classic multiview representation of 3D meshes and study several techniques that make them effective for 3D semantic segmentation of meshes. Given a 3D mesh reconstructed from RGBD sensors, our method effectively chooses different virtual views of the 3D mesh and renders multiple 2D channels for training an effective 2D semantic segmentation model. Features from multiple per view predictions are finally fused on 3D mesh vertices to predict mesh semantic segmentation labels. Using the large scale indoor 3D semantic segmentation benchmark of ScanNet, we show that our virtual views enable more effective training of 2D semantic segmentation networks than previous multiview approaches. When the 2D per pixel predictions are aggregated on 3D surfaces, our virtual multiview fusion method is able to achieve significantly better 3D semantic segmentation results compared to all prior multiview approaches and competitive with recent 3D convolution approaches. View details
    Preview abstract We present a simple and flexible object detection framework optimized for autonomous driving. Building on the observation that point clouds in this application are extremely sparse, we propose a practical pillar-based approach to fix the imbalance issue caused by anchors. In particular, our algorithm incorporates a cylindrical projection into multi-view feature learning, predicts bounding box parameters per pillar rather than per point or per anchor, and includes an aligned pillar-to-point projection module to improve the final prediction. Our anchor-free approach avoids hyperparameter search associated with past methods, simplifying 3D object detection while significantly improving upon state-of-the-art. View details
    Preview abstract Active speaker detection is an important component in video analysis algorithms for applications such as speaker diarization, video re-targeting for meetings, speech enhancement, and human-robot interaction. The absence of a large, carefully labeled audio-visual active speaker dataset has limited evaluation in terms of data diversity, environments, and accuracy. In this paper, we present the AVA Active Speaker detection dataset (AVA-ActiveSpeaker) which has been publicly released to facilitate algorithm development and comparison. It contains temporally labeled face tracks in videos, where each face instance is labeled as speaking or not, and whether the speech is audible. The dataset contains about 3.65 million human labeled frames spanning 38.5 hours. We also introduce a state-of-the-art, jointly trained audio-visual model for real-time active speaker detection and compare several variants. The evaluation clearly demonstrates a significant gain due to audio-visual modeling and temporal integration over multiple frames. View details
    Preview abstract We propose DOPS, a fast single-stage 3D object detection method for LIDAR data. Previous methods often make domain-specific design decisions, for example projecting points into a bird-eye view image in autonomous driving scenarios. In contrast, we propose a general-purpose method that works on both indoor and outdoor scenes. The core novelty of our method is a fast, single-pass architecture that both detects objects in 3D and estimates their shapes. 3D bounding box parameters are estimated in one pass for every point, aggregated through graph convolutions, and fed into a branch of the network that predicts latent codes representing the shape of each detected object. The latent shape space and shape decoder are learned on a synthetic dataset and then used as supervision for the end-toend training of the 3D object detection pipeline. Thus our model is able to extract shapes without access to groundtruth shape information in the target dataset. During experiments, we find that our proposed method achieves stateof-the-art results by ∼5% on object detection in ScanNet scenes, and it gets top results by 3.4% in the Waymo Open Dataset, while reproducing the shapes of detected cars. View details
    Preview abstract This paper introduces a video dataset of spatio-temporally localized Atomic Visual Actions (AVA). The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1.58M action labels with multiple labels per person occurring frequently. The key characteristics of our dataset are: (1) the definition of atomic visual actions, rather than composite actions; (2) precise spatio-temporal annotations with possibly multiple annotations for each person; (3) exhaustive annotation of these atomic actions over 15-minute video clips; (4) people temporally linked across consecutive segments; and (5) using movies to gather a varied set of action representations. This departs from existing datasets for spatio-temporal action recognition, which typically provide sparse annotations for composite actions in short video clips. We will release the dataset publicly. AVA, with its realistic scene and action complexity, exposes the intrinsic difficulty of action recognition. To benchmark this, we present a novel approach for action localization that builds upon the current state-of-the-art methods, and demonstrates better performance on JHMDB and UCF101-24 categories. While setting a new state of the art on existing datasets, the overall results on AVA are low at 15.6% mAP, underscoring the need for developing new approaches for video understanding. View details
    Preview abstract Speech activity detection (or endpointing) is an important processing step for applications such as speech recognition, language identification and speaker diarization. Both audio- and vision-based approaches have been used for this task in various settings and with multiple variations tailored toward applications. Unfortunately, much of the prior work reports results in synthetic settings, on task-specific datasets, or on datasets that are not openly available. This makes it difficult to compare approaches in similar settings and to understand their strengths and weaknesses. In this paper, we describe a new dataset of densely labeled speech activity in YouTube video clips, which has been designed to address these issues and will be released publicly. The dataset labels go beyond speech alone, annotating three specific speech activity situations: clean speech, speech and music co-occurring, and speech and noise co-occurring. These classes will enable further analysis of model performance in the presence of noise. We report benchmark performance numbers on this dataset using state-of-the-art audio and vision models. View details
    Preview abstract We present a system that associates faces with voices in a video by fusing information from the audio and visual signals. The thesis underlying our work is that an extreme simple approach to generating (weak) speech clusters can be combined with strong visual signals to effectively associate faces and voices by aggregating statistics across a video. This approach does not need any training data specific to this task and leverages the natural coherence of information in the audio and visual streams. It is particularly applicable to tracking speakers in videos on the web where a priori information about the environment (e.g., number of speakers, spatial signals for beamforming) is not available. View details
    Preview abstract The massive growth of sports videos has resulted in a need for automatic generation of sports highlights that are comparable in quality to the hand-edited highlights produced by broadcasters such as ESPN. Unlike previous works that mostly use audio-visual cues derived from the video, we propose an approach that additionally leverages contextual cues derived from the environment that the game is being played in. The contextual cues provide information about the excitement levels in the game, which can be ranked and selected to automatically produce high-quality basketball highlights. We introduce a new dataset of 25 NCAA games along with their play-by-play stats and the ground-truth excitement data for each basket. We explore the informativeness of five different cues derived from the video and from the environment through user studies. Our experiments show that for our study participants, the highlights produced by our system are comparable to the ones produced by ESPN for the same games. View details
    Preview abstract We present a technique that uses images, videos and sensor data taken from first-person point-of-view devices to perform egocentric field-of-view (FOV) localization. We define egocentric FOV localization as capturing the visual information from a person’s field-of-view in a given environment and transferring this information onto a reference corpus of images and videos of the same space, hence determining what a person is attending to. Our method matches images and video taken from the first-person perspective with the reference corpus and refines the results using the first-person’s head orientation information obtained using the device sensors. We demonstrate single and multi-user egocentric FOV localization in different indoor and outdoor environments with applications in augmented reality, event understanding and studying social interactions. View details
    Preview abstract We present a method for learning an embedding that places images of humans in similar poses nearby. This embedding can be used as a direct method of comparing images based on human pose, avoiding potential challenges of estimating body joint positions. Pose embedding learning is formulated under a triplet-based distance criterion. A deep architecture is used to allow learning of a representation capable of making distinctions between different poses. Experiments on human pose matching and retrieval from video data demonstrate the potential of the method. View details
    Indoor Scene Understanding with Geometric and Semantic Contexts
    Wongun Choi
    Yu-Wei Chao
    Silvio Savarese
    International Journal of Computer Vision (IJCV) (2014)
    Preview abstract Truly understanding a scene involves integrating information at multiple levels as well as studying the interactions between scene elements. Individual object detectors, layout estimators and scene classifiers are powerful but ultimately confounded by complicated real-world scenes with high variability, different viewpoints and occlusions. We propose a method that can automatically learn the interactions among scene elements and apply them to the holistic understanding of indoor scenes from a single image. This interpretation is performed within a hierarchical interaction model which describes an image by a parse graph, thereby fusing together object detection, layout estimation and scene classification. At the root of the parse graph is the scene type and layout while the leaves are the individual detections of objects. In between is the core of the system, our 3D Geometric Phrases (3DGP). We conduct extensive experimental evaluations on single image 3D scene understanding using both 2D and 3D metrics. The results demonstrate that our model with 3DGPs can provide robust estimation of scene type, 3D space, and 3D objects by leveraging the contextual relationships among the visual elements. View details
    Discovering Groups of People in Images
    Wongun Choi
    Yu-Wei Chao
    Silvio Savarese
    European Conference on Computer Vision (ECCV) (2014)
    Preview abstract Understanding group activities from images is an important yet challenging task. This is because there is an exponentially large number of semantic and geometrical relationships among individuals that one must model in order to effectively recognize and localize the group activities. Rather than focusing on directly recognizing group activities as most of the previous works do, we advocate the importance of introducing an intermediate representation for modeling groups of humans which we call structure groups. Such groups define the way people spatially interact with each other. People might be facing each other to talk, while others sit on a bench side by side, and some might stand alone. In this paper we contribute a method for identifying and localizing these structured groups in a single image despite their varying viewpoints, number of participants, and occlusions. We propose to learn an ensemble of discriminative interaction patterns to encode the relationships between people in 3D and introduce a novel efficient iterative augmentation algorithm for solving this complex inference problem. A nice byproduct of the inference scheme is an approximate 3D layout estimate of the structured groups in the scene. Finally, we contribute an extremely challenging new dataset that contains images each showing multiple people performing multiple activities. Extensive evaluation confirms our theoretical findings. View details
    Temporal Synchronization of Multiple Audio Signals
    Sasi Inguva
    Andy Crawford
    Hugh Denman
    Anil Kokaram
    Proceedings of the International Conference on Signal Processing (ICASSP), Florence, Italy (2014)
    Preview abstract Given the proliferation of consumer media recording devices, events often give rise to a large number of recordings. These recordings are taken from different spatial positions and do not have reliable timestamp information. In this paper, we present two robust graph-based approaches for synchronizing multiple audio signals. The graphs are constructed atop the over-determined system resulting from pairwise signal comparison using cross-correlation of audio features. The first approach uses a Minimum Spanning Tree (MST) technique, while the second uses Belief Propagation (BP) to solve the system. Both approaches can provide excellent solutions and robustness to pairwise outliers, however the MST approach is much less complex than BP. In addition, an experimental comparison of audio features-based synchronization shows that spectral flatness outperforms the zero-crossing rate and signal energy. View details
    Understanding Indoor Scenes using 3D Geometric Phrases
    Wongun Choi
    Yu-Wei Chao
    Silvio Savarese
    Proceedings of International Conference on Computer Vision and Pattern Recognition (CVPR 2013)
    Preview abstract Visual scene understanding is a difficult problem interleaving object detection, geometric reasoning and scene classification. We present a hierarchical scene model for learning and reasoning about complex indoor scenes which is computationally tractable, can be learned from a reasonable amount of training data, and avoids oversimplification. At the core of this approach is the 3D Geometric Phrase Model which captures the semantic and geometric relationships between objects which frequently co-occur in the same 3D spatial configuration. Experiments show that this model effectively explains scene semantics, geometry and object groupings from a single image, while also improving individual object detections. View details
    A Discriminative Model for Learning Semantic and Geometric Interactions in Indoor Scenes
    Wongun Choi
    Yu-Wei Chao
    Silvio Savarese
    Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Scene Understanding Workshop (SUNw) (2013)
    Preview
    Robots for Humanity: Using Assistive Robotics to Empower People with Disabilities
    Tiffany Chen
    Matei Ciocarlie
    Steve Cousins
    Phillip Grice
    Kelsey Hawkins
    Kaijen Hsiao
    Charlie Kemp
    C.-H. King
    Dan Lazewatsky
    Adam Leeper
    Hai Nguyen
    Andreas Paepcke
    William D. Smart
    Leila Takayama
    IEEE Robotics & Automation Magazine, Special issue on Assistive Robotics (2013)
    Layout Estimation of Highly Cluttered Indoor Scenes using Geometric and Semantic Cues
    Yu-Wei Chao
    Wongun Choi
    Silvio Savarese
    Proc. of the International Conference on Image Analysis and Processing (ICIAP) (2013)
    An adaptable system for RGB-D based human body detection and pose estimation
    Koen Buys
    Cedric Cagniart
    Anatoly Baksheev
    Tinne De Laet
    Joris De Schutter
    Journal of Visual Communication and Image Representation (2013)
    Programming Robots at the Museum
    Austin Hendrix
    Andreas Paepcke
    Dirk Thomas
    Sharon Marzouk
    Sarah Elliott
    Proc. of the International Conference on Interaction Design and Children (2013)
    Robots for Humanity: User-Centered Design for Assistive Mobile Manipulation
    Tiffany Chen
    Matei Ciocarlie
    Steve Cousins
    Phillip Grice
    Kelsey Hawkins
    Kaijen Hsiao
    Charlie Kemp
    C.-H. King
    Dan Lazewatsky
    Adam Leeper
    Hai Nguyen
    Andreas Paepcke
    William D. Smart
    Leila Takayama
    Video Proc. of Intelligent Robots and Systems (IROS) (2012)
    A General Framework for Tracking Multiple People from a Moving Camera
    Wongun Choi
    Silvio Savarese
    IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) (2012)
    Exploring the Role of Robots in Home Organization
    Leila Takayama
    Tully Foote
    Bianca Soto
    Proc. of Human-Robot Interaction (HRI) (2012)
    Making technology homey: Finding sources of satisfaction and meaning in home automation
    Leila Takayama
    David J. Robson
    Bianca Soto
    Michael Barry
    Proc. of Ubiquitous Computing (UbiComp) (2012)
    User Observation & Dataset Collection for Robot Training
    Proc. of Human-Robot Interaction (HRI) (2011)
    Need Finding: A Tool for Directing Robotics Research and Development
    Leila Takayama
    The Workshop on Human-Robot Interaction, at the Robotics: Science and Systems (RSS) Conference (2011)
    Using Depth Information to Improve Face Detection
    Walker Burgin
    William D. Smart
    Proc. of Human-Robot Interaction (HRI) (2011)
    Detecting and Tracking People using an RGB-D Camera via Multiple Detector Fusion
    Wongun Choi
    Silvio Savarese
    Workshop on Challenges and Opportunities in Robot Perception, at the International Conference on Computer Vision (ICCV) (2011)
    A Side of Data with My Robot: Three Datasets for Mobile Manipulation in Human Environments
    Matei Ciocarlie
    Kaijen Hsiao
    Gary Bradski
    Peter Brook
    Ethan Dreyfuss
    IEEE Robotics & Automation Magazine, Special Issue: Towards a WWW for Robots (2011)
    Towards Autonomous Robotic Butlers: Lessons Learned with the PR2
    Jonathan Bohren
    Radu B. Rusu
    E. Gil Jones
    Eitan Marder-Eppstein
    Melonee Wise
    Lorenz Mosenlechner
    Wim Meeussen
    Stefan Holzer
    International Conference on Robotics and Automation (ICRA) (2011)
    Help Me Help You: Interfaces for Personal Robots
    Ian Goodfellow
    Nate Koenig
    Marius Muja
    Alex Sorokin
    Leila Takayama
    Proc. of Human Robot Interaction (HRI) (2010)
    Influences on proxemic behaviors in human-robot interaction
    Leila Takayama
    IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2009)
    Object recognition by integrating multiple image segmentations
    Cordelia Schmid
    Martial Hebert
    European Conference on Computer Vision (ECCV) (2008)
    Studies in Using Image Segmentation to Improve Object Recognition
    Ph.D. Thesis, The Robotics Institute, Carnegie Mellon University (2008)
    A framework for learning to recognize and segment object classes using weakly supervised training data
    Martial Hebert
    Proc. of the British Machine Vision Conference (BMVC) (2007)
    Discriminative Cluster Refinement: Improving Object Category Recognition Given Limited Training Data
    Liu Yang
    Rong Jin
    Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2007)
    Toward Objective Evaluation of Image Segmentation Algorithms
    Ranjith Unnikrishnan
    Martial Hebert
    IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29 (2007), pp. 929-944
    Combining Regions and Patches for Object Class Localization
    Gyuri Dorko
    Cordelia Schmid
    Martial Hebert
    Proc. of the Beyond Patches workshop (BP) in conjunction with the IEEE conference on Computer Vision and Pattern Recognition (CVPR) (2006)
    A measure for objective evaluation of image segmentation algorithms
    Ranjith Unnikrishnan
    Martial Hebert
    Computer Vision and Pattern Recognition - Workshops (2005)
    A comparison of image segmentation algorithms
    Martial Hebert
    The Robotics Institute, Carnegie Mellon University (2005)
    Method and apparatus for implementing soft constraints in tools used for designing systems on programmable logic devices
    Terry P. Borer
    Gabriel Quan
    Steven Brown
    Deshanand P. Singh
    Chris Sanford
    Vaughn Betz
    Jordan Swartz
    Patent (2003)
    Toward Generating Labeled Maps from Color and Range Data for Robot Navigation
    Ranjith Unnikrishnan
    Martial Hebert
    Proc. of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2003)
    Method and apparatus for utilizing constraints for the routing of a design on a programmable logic device
    Vaughn Betz
    Jordan Swartz
    Patent (2003)
    Fast Hand Gesture Recognition for Real-Time Teleconferencing Applications
    James Maclean
    Rainer Herpers
    Laura Wood
    Kostas Derpanis
    Doug Topalovic
    John K. Tsotsos
    Proc. of the workshop on Recognition, Analysis, and Tracking of Faces and Gestures in Real-Time Systems (RATFG-RTS) in conjunction with the IEEE International Conference on Computer Vision (ICCV) (2001)
    Active Visual Control by Stereo Active Vision Interface (SAVI)
    Rainer Herpers
    Kostas Derpanis
    Doug Topalovic
    James Maclean
    Gil Verghese
    Allan Jepson
    John K. Tsotsos
    Proc. in Artificial Intelligence, GI Workshop on Dynamische Perzeption (2000)