We propose to systematically identify potentially problematic patterns in skin disease classiﬁcation models via quantitative analysis of agreement between saliency maps and human-labeled regions of interest. We further compute summary statistics describing patterns in this agreement for various stratiﬁcations of input examples. Through this analysis, we discover candidate spurious associations learned by the classiﬁer and suggest next steps to handle such associations. Our approach can be used as a debugging tool to systematically spot difﬁcult examples and error categories. Insights from this analysis could guide targeted data collection and improve model generalizability.