Susanna Ricco
My research sits at the intersection of computer vision and ML fairness. I lead a team developing techniques to bring more inclusive machine learning systems to Google products and the broader community. I have a Ph.D. in computer vision from Duke University, where my research focused on long-term dense motion estimation in video.
Research Areas
Authored Publications
Sort By
A Step Toward More Inclusive People Annotations for Fairness
Vittorio Ferrari
Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society (2021)
Preview abstract
The Open Images Dataset contains approximately 9 million images and is a widely accepted dataset for computer vision research. As is common practice for large datasets, the annotations are not exhaustive, with bounding boxes and attribute labels for only a subset of the classes in each image. In this paper, we present a new set of annotations on a subset of the Open Images dataset called the ``MIAP (More Inclusive Annotations for People)'' subset, containing bounding boxes and attributes for all of the people visible in those images. The attributes and labeling methodology for the ``MIAP'' subset were designed to enable research into model fairness. In addition, we analyze the original annotation methodology for the person class and its subclasses, discussing the resulting patterns in order to inform future annotation efforts. By considering both the original and exhaustive annotation sets, researchers can also now study how systematic patterns in training annotations affect modeling.
View details
AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions
Carl Martin Vondrick
Jitendra Malik
CVPR (2018)
Preview abstract
This paper introduces a video dataset of spatio-temporally localized Atomic Visual Actions (AVA). The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1.58M action labels with multiple labels per person occurring frequently. The key characteristics of our dataset are: (1) the definition of atomic visual actions, rather than composite actions; (2) precise spatio-temporal annotations with possibly multiple annotations for each person; (3) exhaustive annotation of these atomic actions over 15-minute video clips; (4) people temporally linked across consecutive segments; and (5) using movies to gather a varied set of action representations. This departs from existing datasets for spatio-temporal action recognition, which typically provide sparse annotations for composite actions in short video clips. We will release the dataset publicly.
AVA, with its realistic scene and action complexity, exposes the intrinsic difficulty of action recognition. To benchmark this, we present a novel approach for action localization that builds upon the current state-of-the-art methods, and demonstrates better performance on JHMDB and UCF101-24 categories. While setting a new state of the art on existing datasets, the overall results on AVA are low at 15.6% mAP, underscoring the need for developing new approaches for video understanding.
View details
Self-Supervised Learning of Structure and Motion from Video
Aikaterini Fragkiadaki
arxiv (2017)
Preview abstract
We propose SfM-Net, a geometry-aware neural network
for motion estimation in videos that decomposes frame-toframe
pixel motion in terms of scene and object depth, camera
motion and 3D object rotations and translations. Given
a sequence of frames, SfM-Net predicts depth, segmentation,
camera and rigid object motions, converts those into
a dense frame-to-frame motion field (optical flow), differentiably
warps frames in time to match pixels and backpropagates.
The model can be trained with various degrees
of supervision: 1) completely unsupervised, 2) supervised
by ego-motion (camera motion), 3) supervised by
depth (e.g., as provided by RGBD sensors), 4) supervised
by ground-truth optical flow. We show that SfM-Net successfully
estimates segmentation of the objects in the scene,
even though such supervision is never provided. It extracts
meaningful depth estimates or infills depth of RGBD sensors
and successfully estimates frame-to-frame camera displacements.
SfM-Net achieves state-of-the-art optical flow
performance. Our work is inspired by the long history of
research in geometry-aware motion estimation, Simultaneous
Localization and Mapping (SLAM) and Structure from
Motion (SfM). SfM-Net is an important first step towards
providing a learning-based approach for such tasks. A major
benefit over the existing optimization approaches is that
our proposed method can improve itself by processing more
videos, and by learning to explicitly model moving objects
in dynamic scenes.
View details
Preview abstract
We propose a method to discover the physical parts of an articulated object class (e.g. tiger, horse) from multiple videos. Since the individual parts of an object can move independently of one another, we discover them as object regions that consistently move relatively with respect to the rest of the object across videos. We then learn a location model of the parts and segment them accurately in the individual videos using an energy function that also enforces temporal and spatial consistency in the motion of the parts. Traditional methods for motion segmentation or non-rigid structure from motion cannot discover parts unless they display independent motion, since they operate on one video at a time. Our method overcomes this problem by discovering the parts across videos, which allows to discover them in videos where they move to videos where they do not.
We evaluate our method on a new dataset of 32 videos of tigers and horses, where we significantly outperform state-of-the art motion segmentation on the task of part discovery (roughly twice the accuracy).
View details