Improving Hit-finding: Multilabel Neural Architecture with DEL
Abstract
DNA-Encoded Libraries (DEL) data, often with millions of data points, enables large deep learning models to make real contributions in drug discovery (e.g., hit-finding). The state-of-the-art method of modeling DEL data, GCNN multiclass model, requires domain experts to create mutually exclusive classification labels from multiple selection readouts of DEL data, which is not always an optimal formulation. In this work, we designed a GCNN multilabel architecture that directly models each selection data to eliminate dependency on human expertise. We selected effective choices for key modeling components such as label reduction scheme from in silico evaluation. To assess its performance in real-world drug discovery settings, we further carried out prospective wet-lab testing where the multilabel model shows consistent improvement in hit-rate (percentage of hits in a proposed molecule list) over the state-of-the-art multiclass model.