The Effect of Debiasing Protein–Ligand Binding Data on Generalization

Vikram Sundar
J. Chem. Inf. Model., 60(2019), 56–62

Abstract

The structured nature of chemical data means machine-learning models trained to predict protein–ligand binding risk overfitting the data, impairing their ability to generalize and make accurate predictions for novel candidate ligands. Data debiasing algorithms, which systematically partition the data to reduce bias and provide a more accurate metric of model performance, have the potential to address this issue. When models are trained using debiased data splits, the reward for simply memorizing the training data is reduced, suggesting that the ability of the model to make accurate predictions for novel candidate ligands will improve. To test this hypothesis, we use distance-based data splits to measure how well a model can generalize. We first confirm that models perform better for randomly split held-out sets than for distant held-out sets. We then debias the data and find, surprisingly, that debiasing typically reduces the ability of models to make accurate predictions for distant held-out test sets and that model performance measured after debiasing is not representative of the ability of a model to generalize. These results suggest that debiasing reduces the information available to a model, impairing its ability to generalize.

Research Areas