Representing smooth functions as compositions of near-identity functions with implications for deep network optimization
Abstract
We show that any smooth bi-Lipschitz $h$ can be represented exactly
as a composition $h_m \circ ... \circ h_1$ of
functions $h_1,...,h_m$ that are close to the identity in
the sense that each $\left(h_i-\Id\right)$ is Lipschitz, and the
Lipschitz constant decreases inversely with the number $m$ of functions
composed.
This implies that $h$ can be represented to any accuracy by
a deep residual network whose nonlinear layers compute functions with
a small Lipschitz constant. Next, we consider a nonlinear regression
problem with a composition of near-identity nonlinear maps. We show
that any critical point of the quadratic criterion in this
near-identity region, with respect to Fr\'echet derivatives of
the respective $h_1,...,h_m$, must be a global minimizer. In contrast,
for residual networks with tanh activiation functions, we show
that there are critical points with respect to
gradient descent on the parameters at near-identity points
that are {\em not} minimizers, even
in the realizable case.