Bilson Jake Libres Campana

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Creating an Empirical Dermatology Dataset Through Crowdsourcing With Web Search Advertisements
    Abbi Ward
    Jimmy Li
    Julie Wang
    Sriram Lakshminarasimhan
    Ashley Carrick
    Jay Hartford
    Pradeep Kumar S
    Sunny Virmani
    Renee Wong
    Margaret Ann Smith
    Dawn Siegel
    Steven Lin
    Justin Ko
    JAMA Network Open (2024)
    Preview abstract Importance: Health datasets from clinical sources do not reflect the breadth and diversity of disease, impacting research, medical education, and artificial intelligence tool development. Assessments of novel crowdsourcing methods to create health datasets are needed. Objective: To evaluate if web search advertisements (ads) are effective at creating a diverse and representative dermatology image dataset. Design, Setting, and Participants: This prospective observational survey study, conducted from March to November 2023, used Google Search ads to invite internet users in the US to contribute images of dermatology conditions with demographic and symptom information to the Skin Condition Image Network (SCIN) open access dataset. Ads were displayed against dermatology-related search queries on mobile devices, inviting contributions from adults after a digital informed consent process. Contributions were filtered for image safety and measures were taken to protect privacy. Data analysis occurred January to February 2024. Exposure: Dermatologist condition labels as well as estimated Fitzpatrick Skin Type (eFST) and estimated Monk Skin Tone (eMST) labels. Main Outcomes and Measures: The primary metrics of interest were the number, quality, demographic diversity, and distribution of clinical conditions in the crowdsourced contributions. Spearman rank order correlation was used for all correlation analyses, and the χ2 test was used to analyze differences between SCIN contributor demographics and the US census. Results: In total, 5749 submissions were received, with a median of 22 (14-30) per day. Of these, 5631 (97.9%) were genuine images of dermatological conditions. Among contributors with self-reported demographic information, female contributors (1732 of 2596 contributors [66.7%]) and younger contributors (1329 of 2556 contributors [52.0%] aged <40 years) had a higher representation in the dataset compared with the US population. Of 2614 contributors who reported race and ethnicity, 852 (32.6%) reported a racial or ethnic identity other than White. Dermatologist confidence in assigning a differential diagnosis increased with the number of self-reported demographic and skin-condition–related variables (Spearman R = 0.1537; P < .001). Of 4019 contributions reporting duration since onset, 2170 (54.0%) reported onset within less than 7 days of submission. Of the 2835 contributions that could be assigned a dermatological differential diagnosis, 2523 (89.0%) were allergic, infectious, or inflammatory conditions. eFST and eMST distributions reflected the geographical origin of the dataset. Conclusions and Relevance: The findings of this survey study suggest that search ads are effective at crowdsourcing dermatology images and could therefore be a useful method to create health datasets. The SCIN dataset bridges important gaps in the availability of images of common, short-duration skin conditions. View details
    Deep learning versus human graders for classifying diabetic retinopathy severity in a nationwide screening program
    Dr. Paisan Raumviboonsuk
    Dr. Peranut Chotcomwongse
    Rajiv Raman
    Sonia Phene
    Kornwipa Hemarat
    Mongkol Tadarati
    Sukhum Silpa-Archa
    Jirawut Limwattanayingyong
    Chetan Rao
    Oscar Kuruvilla
    Jesse Jung
    Jeffrey Tan
    Surapong Orprayoon
    Chawawat Kangwanwongpaisan
    Ramase Sukumalpaiboon
    Chainarong Luengchaichawang
    Jitumporn Fuangkaew
    Pipat Kongsap
    Lamyong Chualinpha
    Sarawuth Saree
    Srirut Kawinpanitan
    Korntip Mitvongsa
    Siriporn Lawanasakol
    Chaiyasit Thepchatri
    Lalita Wongpichedchai
    Lily Peng
    Nature Partner Journal (npj) Digital Medicine (2019)
    Preview abstract Deep learning algorithms have been used to detect diabetic retinopathy (DR) with specialist-level accuracy. This study aims to validate one such algorithm on a large-scale clinical population, and compare the algorithm performance with that of human graders. A total of 25,326 gradable retinal images of patients with diabetes from the community-based, nationwide screening program of DR in Thailand were analyzed for DR severity and referable diabetic macular edema (DME). Grades adjudicated by a panel of international retinal specialists served as the reference standard. Relative to human graders, for detecting referable DR (moderate NPDR or worse), the deep learning algorithm had significantly higher sensitivity (0.97 vs. 0.74, p < 0.001), and a slightly lower specificity (0.96 vs. 0.98, p < 0.001). Higher sensitivity of the algorithm was also observed for each of the categories of severe or worse NPDR, PDR, and DME (p < 0.001 for all comparisons). The quadratic-weighted kappa for determination of DR severity levels by the algorithm and human graders was 0.85 and 0.78 respectively (p < 0.001 for the difference). Across different severity levels of DR for determining referable disease, deep learning significantly reduced the false negative rate (by 23%) at the cost of slightly higher false positive rates (2%). Deep learning algorithms may serve as a valuable tool for DR screening. View details
    Preview abstract Purpose: To present and evaluate a remote, tool-based system and structured grading rubric for adjudicating image-based diabetic retinopathy (DR) grades. Methods: We compared three different procedures for adjudicating DR severity assessments among retina specialist panels, including (1) in-person adjudication based on a previously described procedure (Baseline), (2) remote, tool-based adjudication for assessing DR severity alone (TA), and (3) remote, tool-based adjudication using a feature-based rubric (TA-F). We developed a system allowing graders to review images remotely and asynchronously. For both TA and TA-F approaches, images with disagreement were reviewed by all graders in a round-robin fashion until disagreements were resolved. Five panels of three retina specialists each adjudicated a set of 499 retinal fundus images (1 panel using Baseline, 2 using TA, and 2 using TA-F adjudication). Reliability was measured as grade agreement among the panels using Cohen's quadratically weighted kappa. Efficiency was measured as the number of rounds needed to reach a consensus for tool-based adjudication. Results: The grades from remote, tool-based adjudication showed high agreement with the Baseline procedure, with Cohen's kappa scores of 0.948 and 0.943 for the two TA panels, and 0.921 and 0.963 for the two TA-F panels. Cases adjudicated using TA-F were resolved in fewer rounds compared with TA (P < 0.001; standard permutation test). Conclusions: Remote, tool-based adjudication presents a flexible and reliable alternative to in-person adjudication for DR diagnosis. Feature-based rubrics can help accelerate consensus for tool-based adjudication of DR without compromising label quality. Translational Relevance: This approach can generate reference standards to validate automated methods, and resolve ambiguous diagnoses by integrating into existing telemedical workflows. View details
    Preview abstract Crowdsourcing has enabled the collection, aggregation and refinement of human knowledge and judgment, i.e. ground truth, for problem domains with data of increasing complexity and scale. This scale of ground truth data generation, especially towards the development of machine learning based medical applications that require large volumes of consistent diagnoses, poses significant and unique challenges to quality control. Poor quality control in crowdsourced labeling of medical data can result in undesired effects on patients' health. In this paper, we study medicine-specific quality control problems, including the diversity of grader expertise and diagnosis guidelines' ambiguity in novel datasets of three eye diseases. We present analytical findings on physicians' work patterns, evaluate existing quality control methods that rely on task completion time to circumvent the scarcity and cost problem of generating ground truth medical data, and share our experiences with a real-world system that collects medical labels at scale. View details