Missing Data Imputation using Optimal Transport
Abstract
Missing data is a crucial issue when applying machine learning algorithms to real-world datasets.
Starting from the simple assumption that two
batches extracted randomly from the same dataset
should share the same distribution, we leverage
optimal transport distances to quantify that criterion and turn it into a loss function to impute missing data values. We propose practical methods to
minimize these losses using end-to-end learning,
that can exploit or not parametric assumptions
on the underlying distributions of values. We
evaluate our methods on datasets from the UCI
repository, in MCAR, MAR and MNAR settings.
These experiments show that OT-based methods
match or out-perform state-of-the-art imputation
methods, even for high percentages of missing
values
Starting from the simple assumption that two
batches extracted randomly from the same dataset
should share the same distribution, we leverage
optimal transport distances to quantify that criterion and turn it into a loss function to impute missing data values. We propose practical methods to
minimize these losses using end-to-end learning,
that can exploit or not parametric assumptions
on the underlying distributions of values. We
evaluate our methods on datasets from the UCI
repository, in MCAR, MAR and MNAR settings.
These experiments show that OT-based methods
match or out-perform state-of-the-art imputation
methods, even for high percentages of missing
values