Which Skin Tone Measures are the Most Inclusive? An Investigation of Skin Tone Measures for Machine Learning

Ellis Monk
X Eyee
ACM Journal of Responsible Computing (2024) (to appear)

Abstract

Skin tone plays a critical role in artificial intelligence (AI), especially in biometrics, human sensing, computer vision, and fairness evaluations. However, many algorithms have exhibited unfair bias against people with darker skin tones, leading to misclassifications, poor user experiences, and exclusions in daily life. One reason this occurs is a poor understanding of how well the scales we use to measure and account for skin tone in AI actually represent the variation of skin tones in people affected by these systems. Although the Fitzpatrick scale has become the industry standard for skin tone evaluation in machine learning, its documented bias towards lighter skin tones suggests that other skin tone measures are worth investigating. To address this, we conducted a survey with 2,214 people in the United States to compare three skin tone scales: The Fitzpatrick 6-point scale, Rihanna’s Fenty™ Beauty 40-point skin tone palette, and a newly developed Monk 10-point scale from the social sciences. We find the Fitzpatrick scale is perceived to be less inclusive than the Fenty and Monk skin tone scales, and this was especially true for people from historically marginalized communities (i.e., people with darker skin tones, BIPOCs, and women). We also find no statistically meaningful differences in perceived representation across the Monk skin tone scale and the Fenty Beauty palette. Through this rigorous testing and validation of skin tone measurement, we discuss the ways in which our findings can advance the understanding of skin tone in both the social science and machine learning communities.