
Susanna Ricco
My research sits at the intersection of computer vision and ML fairness. I lead a team developing techniques to bring more inclusive machine learning systems to Google products and the broader community. I have a Ph.D. in computer vision from Duke University, where my research focused on long-term dense motion estimation in video.
Research Areas
Authored Publications
Sort By
Which Skin Tone Measures are the Most Inclusive? An Investigation of Skin Tone Measures for Machine Learning
Ellis Monk
X Eyee
ACM Journal of Responsible Computing (2024) (to appear)
Preview abstract
Skin tone plays a critical role in artificial intelligence (AI), especially in biometrics, human sensing, computer vision, and fairness evaluations. However, many algorithms have exhibited unfair bias against people with darker skin tones, leading to misclassifications, poor user experiences, and exclusions in daily life. One reason this occurs is a poor understanding of how well the scales we use to measure and account for skin tone in AI actually represent the variation of skin tones in people affected by these systems. Although the Fitzpatrick scale has become the industry standard for skin tone evaluation in machine learning, its documented bias towards lighter skin tones suggests that other skin tone measures are worth investigating. To address this, we conducted a survey with 2,214 people in the United States to compare three skin tone scales: The Fitzpatrick 6-point scale, Rihanna’s Fenty™ Beauty 40-point skin tone palette, and a newly developed Monk 10-point scale from the social sciences. We find the Fitzpatrick scale is perceived to be less inclusive than the Fenty and Monk skin tone scales, and this was especially true for people from historically marginalized communities (i.e., people with darker skin tones, BIPOCs, and women). We also find no statistically meaningful differences in perceived representation across the Monk skin tone scale and the Fenty Beauty palette. Through this rigorous testing and validation of skin tone measurement, we discuss the ways in which our findings can advance the understanding of skin tone in both the social science and machine learning communities.
View details
Consensus and Subjectivity of Skin Tone Annotation for ML Fairness
Ellis Monk
Femi Olanubi
Auriel Wright
(2023) (to appear)
Preview abstract
Understanding different human attributes and how they affect model behavior may become a standard need for all model creation and usage, from traditional computer vision tasks to the newest multimodal generative AI systems. In computer vision specifically, we have relied on datasets augmented with perceived attribute signals (e.g., gender presentation, skin tone, and age) and benchmarks enabled by these datasets. Typically labels for these tasks come from human annotators. However, annotating attribute signals, especially skin tone, is a difficult and subjective task. Perceived skin tone is affected by technical factors, like lighting conditions, and social factors that shape an annotator's lived experience. This paper examines the subjectivity of skin tone annotation through a series of annotation experiments using the Monk Skin Tone (MST) scale, a small pool of professional photographers, and a much larger pool of trained crowdsourced annotators. Along with this study we release the Monk Skin Tone Examples (MST-E) dataset, containing 1515 images and 31 videos spread across the full MST scale. MST-E is designed to help train human annotators to annotate MST effectively. Our study shows that annotators can reliably annotate skin tone in a way that aligns with an expert in the MST scale, even under challenging environmental conditions. We also find evidence that annotators from different geographic regions rely on different mental models of MST categories resulting in annotations that systematically vary across regions. Given this, we advise practitioners to use a diverse set of annotators and a higher replication count for each image when annotating skin tone for fairness research.
View details
A Step Toward More Inclusive People Annotations for Fairness
Vittorio Ferrari
Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society (2021)
Preview abstract
The Open Images Dataset contains approximately 9 million images and is a widely accepted dataset for computer vision research. As is common practice for large datasets, the annotations are not exhaustive, with bounding boxes and attribute labels for only a subset of the classes in each image. In this paper, we present a new set of annotations on a subset of the Open Images dataset called the ``MIAP (More Inclusive Annotations for People)'' subset, containing bounding boxes and attributes for all of the people visible in those images. The attributes and labeling methodology for the ``MIAP'' subset were designed to enable research into model fairness. In addition, we analyze the original annotation methodology for the person class and its subclasses, discussing the resulting patterns in order to inform future annotation efforts. By considering both the original and exhaustive annotation sets, researchers can also now study how systematic patterns in training annotations affect modeling.
View details
AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions
Carl Martin Vondrick
Jitendra Malik
CVPR (2018)
Preview abstract
This paper introduces a video dataset of spatio-temporally localized Atomic Visual Actions (AVA). The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1.58M action labels with multiple labels per person occurring frequently. The key characteristics of our dataset are: (1) the definition of atomic visual actions, rather than composite actions; (2) precise spatio-temporal annotations with possibly multiple annotations for each person; (3) exhaustive annotation of these atomic actions over 15-minute video clips; (4) people temporally linked across consecutive segments; and (5) using movies to gather a varied set of action representations. This departs from existing datasets for spatio-temporal action recognition, which typically provide sparse annotations for composite actions in short video clips. We will release the dataset publicly.
AVA, with its realistic scene and action complexity, exposes the intrinsic difficulty of action recognition. To benchmark this, we present a novel approach for action localization that builds upon the current state-of-the-art methods, and demonstrates better performance on JHMDB and UCF101-24 categories. While setting a new state of the art on existing datasets, the overall results on AVA are low at 15.6% mAP, underscoring the need for developing new approaches for video understanding.
View details
Self-Supervised Learning of Structure and Motion from Video
Aikaterini Fragkiadaki
arxiv (2017)
Preview abstract
We propose SfM-Net, a geometry-aware neural network
for motion estimation in videos that decomposes frame-toframe
pixel motion in terms of scene and object depth, camera
motion and 3D object rotations and translations. Given
a sequence of frames, SfM-Net predicts depth, segmentation,
camera and rigid object motions, converts those into
a dense frame-to-frame motion field (optical flow), differentiably
warps frames in time to match pixels and backpropagates.
The model can be trained with various degrees
of supervision: 1) completely unsupervised, 2) supervised
by ego-motion (camera motion), 3) supervised by
depth (e.g., as provided by RGBD sensors), 4) supervised
by ground-truth optical flow. We show that SfM-Net successfully
estimates segmentation of the objects in the scene,
even though such supervision is never provided. It extracts
meaningful depth estimates or infills depth of RGBD sensors
and successfully estimates frame-to-frame camera displacements.
SfM-Net achieves state-of-the-art optical flow
performance. Our work is inspired by the long history of
research in geometry-aware motion estimation, Simultaneous
Localization and Mapping (SLAM) and Structure from
Motion (SfM). SfM-Net is an important first step towards
providing a learning-based approach for such tasks. A major
benefit over the existing optimization approaches is that
our proposed method can improve itself by processing more
videos, and by learning to explicitly model moving objects
in dynamic scenes.
View details
Preview abstract
We propose a method to discover the physical parts of an articulated object class (e.g. tiger, horse) from multiple videos. Since the individual parts of an object can move independently of one another, we discover them as object regions that consistently move relatively with respect to the rest of the object across videos. We then learn a location model of the parts and segment them accurately in the individual videos using an energy function that also enforces temporal and spatial consistency in the motion of the parts. Traditional methods for motion segmentation or non-rigid structure from motion cannot discover parts unless they display independent motion, since they operate on one video at a time. Our method overcomes this problem by discovering the parts across videos, which allows to discover them in videos where they move to videos where they do not.
We evaluate our method on a new dataset of 32 videos of tigers and horses, where we significantly outperform state-of-the art motion segmentation on the task of part discovery (roughly twice the accuracy).
View details