Jump to Content

Representing smooth functions as compositions of near-identity functions with implications for deep network optimization

Peter Bartlett
Steven Evans
arXiv (2018)

Abstract

We show that any smooth bi-Lipschitz $h$ can be represented exactly as a composition $h_m \circ ... \circ h_1$ of functions $h_1,...,h_m$ that are close to the identity in the sense that each $\left(h_i-\Id\right)$ is Lipschitz, and the Lipschitz constant decreases inversely with the number $m$ of functions composed. This implies that $h$ can be represented to any accuracy by a deep residual network whose nonlinear layers compute functions with a small Lipschitz constant. Next, we consider a nonlinear regression problem with a composition of near-identity nonlinear maps. We show that any critical point of the quadratic criterion in this near-identity region, with respect to Fr\'echet derivatives of the respective $h_1,...,h_m$, must be a global minimizer. In contrast, for residual networks with tanh activiation functions, we show that there are critical points with respect to gradient descent on the parameters at near-identity points that are {\em not} minimizers, even in the realizable case.