Crowdsourcing Dermatology Images with Google Search Ads: Creating a Diverse and Representative Dataset of Real-World Skin Conditions

Abbi Ward
Ashley Carrick
Dawn Siegel
Jay Hartford
Jimmy Li
Julie Wang
Justin Ko
Pradeep Kumar S
Renee Wong
Sriram Lakshminarasimhan
Steven Lin
Sunny Virmani
arXiv(2024)

Abstract

Background Health datasets from clinical sources do not reflect the breadth and diversity of disease in the real world, impacting research, medical education and artificial intelligence (AI) tool development. Dermatology is a suitable area to develop and test a new and scalable method to create representative health datasets. Methods We used Google Search advertisements to solicit contributions of images of dermatology conditions, demographic and symptom information from internet users in the United States (US) over 265 days starting March 2023. With informed contributor consent, we described and released this dataset containing 10,106 images from 5058 contributions, with dermatologist labels as well as Fitzpatrick Skin Type and Monk Skin Tone labels for the images. Results We received 22 ± 14 submissions/day over 265 days. Female contributors (66.04%) and younger individuals (52.3% < age 40) had a higher representation in the dataset compared to the US population, and 36.6% of contributors had a non-White racial or ethnic identity. Over 97.5% of contributions were genuine images of skin conditions. Image quality had no impact on dermatologist confidence in assigning a differential diagnosis. The dataset consists largely of short duration (54% with onset < 7 days ago) allergic, infectious, and inflammatory conditions. Fitzpatrick skin type distribution is well-balanced, considering the geographical origin of the dataset and the absence of enrichment for population groups or skin tones. Interpretation Search ads are effective at crowdsourcing images of health conditions. The SCIN dataset bridges important gaps in the availability of representative images of common skin conditions.