Mathieu Blondel
I obtained my PhD in machine learning from Kobe University, Japan, in 2013. From 2013 to 2019, I was a researcher at NTT Communication Science Laboratories in Kyoto, Japan. I am now a research scientist at Google Research, Brain team, in Paris, France. Check my homepage for more info.
Research Areas
Authored Publications
Sort By
Learning Energy Networks with Generalized Fenchel-Young Losses
Felipe Llinares
Léonard Hussenot
Matthieu Geist
Neural Information Processing Systems (NeurIPS) (2022)
Preview abstract
Energy-based models, a.k.a.\ energy networks, perform inference by optimizing
an energy function, typically parametrized by a neural network.
This allows one to capture potentially complex relationships between inputs and
outputs.
To learn the parameters of the energy function, the solution to that
optimization problem is typically fed into a loss function.
The key challenge for training energy networks lies in computing loss gradients,
as this typically requires argmin/argmax differentiation.
In this paper, building upon a generalized notion of conjugate function,
which replaces the usual bilinear pairing with a general energy function,
we propose generalized Fenchel-Young losses, a natural loss construction for
learning energy networks. Our losses enjoy many desirable properties and their
gradients can be computed efficiently without argmin/argmax differentiation.
We also prove the calibration of their excess risk in the case of linear-concave
energies. We demonstrate our losses on multilabel classification and
imitation learning tasks.
View details
Differentiable Divergences Between Time Series
Arthur Mensch
Jean-Philippe Vert
Proceedings of The 24th International Conference on Artificial Intelligence and Statistics (AISTATS), PMLR (2021), pp. 3853-3861
Preview abstract
Computing the discrepancy between time series of variable sizes is notoriously
challenging. While dynamic time warping (DTW) is popularly used for this
purpose, it is not differentiable everywhere and is known to lead to bad local
optima when used as a ``loss''. Soft-DTW addresses these issues,
but it is not a positive definite divergence: due to the bias introduced by
entropic regularization, it can be negative and it is not minimized when the
time series are equal.
We propose in this paper a new divergence, dubbed soft-DTW divergence, which
aims to correct these issues. We study its properties; in
particular, under conditions on the ground cost, we show that it is
non-negative and minimized when the time series are equal. We also propose a new
``sharp'' variant by further removing entropic bias.
We showcase our divergences on time series averaging and demonstrate significant
accuracy improvements compared to both DTW and soft-DTW on 84 time series
classification datasets.
View details
Implicit differentiation of Lasso-type models for hyperparameter optimization
Quentin Bertrand
Quentin Klopfenstein
Samuel Vaiter
Alexandre Gramfort
Joseph Salmon
ICML (2020) (to appear)
Preview abstract
Setting regularization parameters for Lasso-type estimators is notoriously difficult, though crucial in practice. The most popular hyperparameter optimization approach is grid-search using held-out validation data. Grid-search however requires to choose a predefined grid for each parameter, which scales exponentially in the number of parameters. Another approach is to cast hyperparameter optimization as a bi-level optimization problem, one can solve by gradient descent. The key challenge for these methods is the estimation of the gradient with respect to the hyperparameters. Computing this gradient via forward or backward automatic differentiation is possible yet usually suffers from high memory consumption. Alternatively implicit differentiation typically involves solving a linear system which can be prohibitive and numerically unstable in high dimension. In addition, implicit differentiation usually assumes smooth loss functions, which is not the case for Lasso-type problems. This work introduces an efficient implicit differentiation algorithm, without matrix inversion, tailored for Lasso-type problems. Our approach scales to high-dimensional data by leveraging the sparsity of the solutions. Experiments demonstrate that the proposed method outperforms a large number of standard methods to optimize the error on held-out data, or the Stein Unbiased Risk Estimator (SURE).
View details
Learning with Differentiable Perturbed Optimizers
Olivier Teboul
Marco Cuturi
Jean-Philippe Vert
Francis Bach
Advances in Neural Information Processing Systems 33 (NeurIPS), Curran Associates, Inc. (2020), pp. 9508-9519
Preview abstract
Machine learning pipelines often rely on optimization procedures to make discrete decisions (e.g. sorting, picking closest neighbors, finding shortest paths or optimal matchings). Although these discrete decisions are easily computed in a forward manner, they cannot be used to modify model parameters using first-order optimization techniques because they break the back-propagation of computational graphs. In order to expand the scope of learning problems that can be solved in an end-to-end fashion, we propose a systematic method to transform a block that outputs an optimal discrete decision into a differentiable operation. Our approach relies on stochastic perturbations of these parameters, and can be used readily within existing solvers without the need for ad hoc regularization or smoothing. These perturbed optimizers yield solutions that are differentiable and never locally constant. The amount of smoothness can be tuned via the chosen noise amplitude, whose impact we analyze. The derivatives of these perturbed solvers can be evaluated efficiently. We also show how this framework can be connected to a family of losses developed in structured prediction, and describe how these can be used in unsupervised and supervised learning, with theoretical guarantees. We demonstrate the performance of our approach on several machine learning tasks in experiments on synthetic and real data.
View details
Preview abstract
Sorting is an elementary building block of modern software. In machine learning and statistics, it is commonly used in robust statistics, order statistics and ranking metrics. However, sorting is a piecewise linear function and as a result includes many kinks at which it is non-differentiable. More problematic, the ranking operator is a piecewise constant function, meaning that its derivatives are null or undefined. While numerous works have proposed differentiable proxies to sorting and ranking, they do not achieve the $O(n \log n)$ time complexity one could expect from a sorting or ranking operation. In this paper, we propose the first differentiable sorting and ranking operators with $O(n \log n)$ time and $O(n)$ space complexity. Our proposal in addition enjoys exact computation and differentiation. We achieve this feat by casting differentiable sorting and ranking as projections onto a permutahedron, the convex hull of permutations, and using a reduction to isotonic optimization. Empirically, we confirm that our approach is an order of magnitude faster than existing approaches. We also showcase two novel applications: differentiable Spearman's rank coefficient and differentiable least trimmed squares.
View details