Escaping the Gravitational Pull of Softmax
Abstract
The softmax is the standard transformation used in machine learning to map real-valued vectors to categorical distributions.
Unfortunately, the softmax poses serious drawbacks for gradient descent optimization. We establish two negative results for this transform: (1) optimizing any expectation with respect to the softmax must exhibit extreme sensitivity to parameter initialization
(``the softmax gravity well''), and (2) optimizing log-probabilities under the softmax must exhibit slow convergence (``softmax damping''). Both findings are based on an analysis of convergence rates using the Lojasiewicz inequality. To circumvent these shortcomings we investigate an alternative transformation, the escort (p-norm) mapping, that demonstrates better optimization properties. In addition to proving bounds on convergence rates to firmly establish these results, we also provide experimental evidence for the superiority of the escort transformation.