 
                Mathieu Blondel
            I obtained my PhD in machine learning from Kobe University, Japan, in 2013. From 2013 to 2019, I was a researcher at NTT Communication Science Laboratories in Kyoto, Japan. I am now a research scientist at Google Research, Brain team, in Paris, France. Check my homepage for more info.
          
        
        Research Areas
      Authored Publications
    
  
  
  
    
    
  
      
        Sort By
        
        
    
    
        
          
            
              Learning Energy Networks with Generalized Fenchel-Young Losses
            
          
        
        
          
            
              
                
                  
                    
                
              
            
              
                
                  
                    
                    
    
    
    
    
    
                      
                        Felipe Llinares
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Léonard Hussenot
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Matthieu Geist
                      
                    
                  
              
            
          
          
          
          
            Neural Information Processing Systems (NeurIPS) (2022)
          
          
        
        
        
          
              Preview abstract
          
          
              Energy-based models, a.k.a.\ energy networks, perform inference by optimizing 
an energy function, typically parametrized by a neural network. 
This allows one to capture potentially complex relationships between inputs and
outputs.
To learn the parameters of the energy function, the solution to that
optimization problem is typically fed into a loss function.
The key challenge for training energy networks lies in computing loss gradients,
as this typically requires argmin/argmax differentiation.
In this paper, building upon a generalized notion of conjugate function,
which replaces the usual bilinear pairing with a general energy function,
we propose generalized Fenchel-Young losses, a natural loss construction for
learning energy networks. Our losses enjoy many desirable properties and their
gradients can be computed efficiently without argmin/argmax differentiation.
We also prove the calibration of their excess risk in the case of linear-concave
energies. We demonstrate our losses on multilabel classification and 
imitation learning tasks.
              
  
View details
          
        
      
    
        
          
            
              Differentiable Divergences Between Time Series
            
          
        
        
          
            
              
                
                  
                    
                
              
            
              
                
                  
                    
                    
    
    
    
    
    
                      
                        Arthur Mensch
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Jean-Philippe Vert
                      
                    
                  
              
            
          
          
          
          
            Proceedings of The 24th International Conference on Artificial Intelligence and Statistics (AISTATS), PMLR (2021), pp. 3853-3861
          
          
        
        
        
          
              Preview abstract
          
          
              Computing the discrepancy between time series of variable sizes is notoriously
challenging.  While dynamic time warping (DTW) is popularly used for this
purpose, it is not differentiable everywhere and is known to lead to bad local
optima when used as a ``loss''. Soft-DTW addresses these issues,
but it is not a positive definite divergence: due to the bias introduced by
entropic regularization, it can be negative and it is not minimized when the 
time series are equal. 
We propose in this paper a new divergence, dubbed soft-DTW divergence, which
aims to correct these issues. We study its properties; in
particular, under conditions on the ground cost, we show that it is
non-negative and minimized when the time series are equal. We also propose a new
``sharp'' variant by further removing entropic bias.  
We showcase our divergences on time series averaging and demonstrate significant
accuracy improvements compared to both DTW and soft-DTW on 84 time series
classification datasets.
              
  
View details
          
        
      
    
        
          
            
              Implicit differentiation of Lasso-type models for hyperparameter optimization
            
          
        
        
          
            
              
                
                  
                    
    
    
    
    
    
                      
                        Quentin Bertrand
                      
                    
                
              
            
              
                
                  
                    
                    
                      
                        Quentin Klopfenstein
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Samuel Vaiter
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Alexandre Gramfort
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Joseph Salmon
                      
                    
                  
              
            
          
          
          
          
            ICML (2020) (to appear)
          
          
        
        
        
          
              Preview abstract
          
          
              Setting regularization parameters for Lasso-type estimators is notoriously difficult, though crucial in practice. The most popular hyperparameter optimization approach is grid-search using held-out validation data. Grid-search however requires to choose a predefined grid for each parameter, which scales exponentially in the number of parameters. Another approach is to cast hyperparameter optimization as a bi-level optimization problem, one can solve by gradient descent. The key challenge for these methods is the estimation of the gradient with respect to the hyperparameters. Computing this gradient via forward or backward automatic differentiation is possible yet usually suffers from high memory consumption. Alternatively implicit differentiation typically involves solving a linear system which can be prohibitive and numerically unstable in high dimension. In addition, implicit differentiation usually assumes smooth loss functions, which is not the case for Lasso-type problems. This work introduces an efficient implicit differentiation algorithm, without matrix inversion, tailored for Lasso-type problems. Our approach scales to high-dimensional data by leveraging the sparsity of the solutions. Experiments demonstrate that the proposed method outperforms a large number of standard methods to optimize the error on held-out data, or the Stein Unbiased Risk Estimator (SURE).
              
  
View details
          
        
      
    
        
          
            
              Learning with Differentiable Perturbed Optimizers
            
          
        
        
          
            
              
                
                  
                    
                
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
    
    
    
    
    
                      
                        Olivier Teboul
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Marco Cuturi
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Jean-Philippe Vert
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Francis Bach
                      
                    
                  
              
            
          
          
          
          
            Advances in Neural Information Processing Systems 33 (NeurIPS), Curran Associates, Inc. (2020), pp. 9508-9519
          
          
        
        
        
          
              Preview abstract
          
          
              Machine learning pipelines often rely on optimization procedures to make discrete decisions (e.g. sorting, picking closest neighbors, finding shortest paths or optimal matchings). Although these discrete decisions are easily computed in a forward manner, they cannot be used to modify model parameters using first-order optimization techniques because they break the back-propagation of computational graphs. In order to expand the scope of learning problems that can be solved in an end-to-end fashion, we propose a systematic method to transform a block that outputs an optimal discrete decision into a differentiable operation. Our approach relies on stochastic perturbations of these parameters, and can be used readily within existing solvers without the need for ad hoc regularization or smoothing.  These perturbed optimizers yield solutions that are differentiable and never locally constant. The amount of smoothness can be tuned via the chosen noise amplitude, whose impact we analyze. The derivatives of these perturbed solvers can be evaluated efficiently. We also show how this framework can be connected to a family of losses developed in structured prediction, and describe how these can be used in unsupervised and supervised learning, with theoretical guarantees. We demonstrate the performance of our approach on several machine learning tasks in experiments on synthetic and real data.
              
  
View details
          
        
      
    
        
        
          
              Preview abstract
          
          
              Sorting is an elementary building block of modern software. In machine learning and statistics, it is commonly used in robust statistics, order statistics and ranking metrics. However, sorting is a piecewise linear function and as a result includes many kinks at which it is non-differentiable. More problematic, the ranking operator is a piecewise constant function, meaning that its derivatives are null or undefined. While numerous works have proposed differentiable proxies to sorting and ranking, they do not achieve the $O(n \log n)$ time complexity one could expect from a sorting or ranking operation. In this paper, we propose the first differentiable sorting and ranking operators with $O(n \log n)$ time and $O(n)$ space complexity. Our proposal in addition enjoys exact computation and differentiation. We achieve this feat by casting differentiable sorting and ranking as projections onto a permutahedron, the convex hull of permutations, and using a reduction to isotonic optimization. Empirically, we confirm that our approach is an order of magnitude faster than existing approaches. We also showcase two novel applications: differentiable Spearman's rank coefficient and differentiable least trimmed squares.
              
  
View details
          
        
      
    