The Effect of Debiasing Protein–Ligand Binding Data on Generalization

Vikram Sundar

Lucy J. Colwell

J. Chem. Inf. Model., 60(2019), 56–62

Download Google Scholar

Abstract

The structured nature of chemical data means machine-learning models trained to predict protein–ligand binding risk overfitting the data, impairing their ability to generalize and make accurate predictions for novel candidate ligands. Data debiasing algorithms, which systematically partition the data to reduce bias and provide a more accurate metric of model performance, have the potential to address this issue. When models are trained using debiased data splits, the reward for simply memorizing the training data is reduced, suggesting that the ability of the model to make accurate predictions for novel candidate ligands will improve. To test this hypothesis, we use distance-based data splits to measure how well a model can generalize. We first confirm that models perform better for randomly split held-out sets than for distant held-out sets. We then debias the data and find, surprisingly, that debiasing typically reduces the ability of models to make accurate predictions for distant held-out test sets and that model performance measured after debiasing is not representative of the ability of a model to generalize. These results suggest that debiasing reduces the information available to a model, impairing its ability to generalize.

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations  & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

The Effect of Debiasing Protein–Ligand Binding Data on Generalization

Abstract

Research Areas

Meet the teams driving innovation

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

The Effect of Debiasing Protein–Ligand Binding Data on Generalization

Abstract

Research Areas

Meet the teams driving innovation

AI/ML Foundations  & Capabilities