Phil Long
Research Areas
Authored Publications
Google Publications
Other Publications
Sort By
Preview abstract
van Rooyen et al. introduced a notion of convex loss functions being robust to random classification noise, and established that the ``unhinged'' loss function is robust in this sense. In this note we study the accuracy of binary classifiers obtained by minimizing the unhinged loss, and observe that even for simple linearly separable data distributions, minimizing the unhinged loss may only yield a binary classifier with
accuracy no better than random guessing.
View details
Preview abstract
We consider the problem of sampling from a strongly log-concave density in $\Re{d}$, and prove an information theoretic \emph{lower bound} on the number of stochastic gradient queries of the log density needed. Several popular sampling algorithms (including many Markov chain Monte Carlo methods) operate by using stochastic gradients of the log density to generate a sample; our results establish an information theoretic limit for all these algorithms.
We show that for every algorithm, there exists a well-conditioned strongly log-concave target density for which the distribution of points generated by the algorithm would be at least $\epsilon$ away from the target in total variation distance if the number of gradient queries is less than $\Omega(\var d/\epsilon^2)$, where $\var d$ is the variance of the stochastic gradient. Our lower bound follows by combining the ideas of Le Cam deficiency routinely used in the comparison of statistical experiments along with standard information theoretic tools used in lower bounding Bayes risk functions. To the best of our knowledge our results provide the first nontrivial dimension-dependent lower bound for this problem.
View details
Preview abstract
We consider bounds on the generalization ability of the least-norm linear regressor, in the over-parameterized regime where it can interpolate the data. We describe a sense in which any generalization bound of a type that is commonly proved in statistical learning theory must sometimes be very loose when applied to analyze the least-norm interpolant. In particular, for a variety of natural joint distributions on training examples, any valid generalization bound that depends only on the output of the learning algorithm, the number of training examples, and the confidence parameter, and that satisfies a mild condition (substantially weaker than monotonicity in sample size), must sometimes be very loose —it can be bounded below by a constant when the true excess risk goes to zero.
View details
Superlinear integrality gaps for the minimum majority problem
siam journal on discrete mathematics, vol. 35(4) (2021), pp. 3004-3016 (to appear)
Preview abstract
The minimum majority problem is as follows:
given an m-by-n matrix A with {-1,1} entries, minimize x_1 + ... + x_n subject to A x > 0 for x = (x_1,...,x_n), where x_1,...,x_n are non-negative integers. An approximation algorithm
that finds a solution with value O(opt^2 log m) in poly(m,n,opt) is known, which can be obtained by
rounding a linear programming relaxation.
We establish integrality gaps that limit the prospects for improving
on this guarantee through improved rounding and/or the application of
Lov'asz-Schrijver or Sherali-Adams tightening of the relaxation.
These gaps show that these methods cannot improve on the
O(opt^2 log m) guarantee by more than a constant factor in
polynomial time.
View details
Preview abstract
We prove bounds on the population risk of the maximum margin algorithm for two-class linear classification. For linearly separable training data, the maximum margin algorithm has been shown in previous work to be equivalent to a limit of training with logistic loss using gradient descent, as the
training error is driven to zero. We analyze this algorithm applied to random
data including misclassification noise. Our assumptions on the clean data include the case in which the class-conditional distributions are standard normal distributions. The misclassification noise may be chosen by an adversary, subject to a limit on the fraction of corrupted labels. Our bounds show that,
with sufficient overparameterization, the maximum margin algorithm trained on noisy data can achieve
nearly optimal population risk.
View details
Preview abstract
We study the training of finite-width two-layer (smoothed) ReLU networks for
binary classification using the logistic loss. We show that gradient descent drives the training loss to zero, if the initial loss is small enough. When the data satisfies certain cluster and separation conditions, and the network is wide
enough, we show that one step of gradient descent reduces the loss sufficiently that the first result applies. (In contrast, all past analyses that we know of fixed-width networks do not guarantee that the training loss goes to zero.)
View details
Preview abstract
We prove
that gradient
descent applied to fixed-width deep networks with the logistic
loss converges, and prove bounds on the rate
of convergence. Our analysis applies for smoothed approximations to the ReLU proposed in previous applied work such as Swish and the Huberized ReLU. We provide two sufficient conditions for convergence. The first is simply a bound on the loss at initialization. The second is a data separation condition used in prior analyses.
View details
New bounds on the price of bandit feedback for mistake-bounded online multiclass learning
Theoretical Computer Science, vol. 808 (2020), pp. 159-163
Preview abstract
We study two generalizations of the
mistake bound model to online multiclass classification.
In the standard model, the learner receives the
correct classification at the end of each round, and
in the bandit model, the learner only finds out whether
its prediction was correct or not. For a set F of multiclass
classifiers, let opts(F) and optb(F) be the optimal bounds
for learning F according to these two models. We show that an
optb(F) < (1 + o(1)) (|Y| ln |Y|) opts(F) bound is the best possible
up to the leading constant, closing a Theta(log |Y|) factor gap.
View details
Preview abstract
The phenomenon of benign overfitting is one of the key
mysteries uncovered by deep learning methodology: deep
neural networks seem to predict well, even with a perfect
fit to noisy training data. Motivated by this phenomenon,
we consider when a perfect fit to training data in
linear regression is compatible with accurate prediction.
We give a characterization of gaussian linear regression
problems for which the minimum norm interpolating prediction rule
has near-optimal prediction accuracy. The characterization is in
terms of two notions of the effective rank of the data covariance.
It shows that overparameterization is essential for
benign overfitting in this setting: the number of directions
in parameter space that are unimportant for prediction must
significantly exceed the sample size.
View details
Preview abstract
We prove bounds on the generalization error of convolutional networks.
The bounds are in terms of the training loss, the number of
parameters, the Lipschitz constant of the loss and the distance from
the weights to the initial weights. They are independent of the
number of pixels in the input, and the height and width of hidden
feature maps. We present experiments
with CIFAR-10 and a scaled-down variant, along with varying hyperparameters
of a deep convolutional network, comparing our bounds with practical
generalization gaps.
View details
Preview abstract
For proper distribution-free learning
of linear classifiers in d dimensions,
we prove a lower bound on the optimal
expected error of (d - o(1))/m,
improving on the best previous lower bound of
(d/sqrt(e) - o(1))/m and nearly matching a (d+1)/(m+1) upper
bound achieved by the maximum margin classifier.
View details
Preview abstract
We study the convergence of gradient descent (GD) and stochastic gradient descent (SGD) for training L-hidden-layer linear residual networks (ResNets). We prove that for training deep residual networks with certain linear transformations at input and output layers, which are fixed throughout training,
both GD and SGD with zero initialization on all hidden weights can converge to the global minimum of the training loss. Moreover, when specializing to appropriate Gaussian random linear transformations, GD and SGD provably optimize wide enough deep linear ResNets. Compared with a previous global convergence result of GD for training standard deep linear networks, our condition on the neural network width is sharper by a factor of kappa L, where kappa denotes the condition number of the covariance matrix of the training data. We further propose a modified identity input and output transformations, and show that a (d+k)-wide neural network is sufficient to guarantee the global convergence of GD/SGD, where d and k are the input and output dimensions respectively.
View details
Preview abstract
We characterize the singular values of the linear transformation associated with
a standard 2D multi-channel convolutional layer, enabling their
efficient computation.
This characterization also leads to an algorithm for projecting a convolutional layer onto
an operator-norm ball.
We show that this is an effective regularizer;
for example, it improves the test error of a deep residual network
using batch normalization
on CIFAR-10 from 6.2% to 5.3%.
View details
Preview abstract
We analyze the joint probability distribution on the lengths of the vectors of hidden variables
in different layers of a fully connected deep network, when the weights and biases are chosen
randomly according to Gaussian distributions. We show that, if the activation function φ satisfies a minimal set of assumptions, satisfied by all activation functions that we know that are used in practice, then, as the width of the network gets large, the “length process” converges in probability to a length map that is determined as a simple function of the variances of the random weights and biases, and the activation function φ. We also show that this convergence may fail for φ that violate our assumptions.
View details
Preview abstract
We study density estimation for classes of shift-invariant
distributions over R^d. A multidimensional distribution is
``shift-invariant'' if, roughly speaking, it is close in total
variation distance to a small shift of it in any direction. Shift-invariance relaxes smoothness
assumptions commonly used in non-parametric density estimation to
allow jump discontinuities. The different classes of distributions
that we consider correspond to different rates of tail decay.
For each such class we give an efficient algorithm that learns any
distribution in the class from independent samples with respect to
total variation distance. As a special case of our general result, we
show that multivariate log-concave distributions with a constant
number of variables can be learned in polynomial time, answering a
question of Diakonikolas et al. All of our results extend to a model
of noise-tolerant density estimation, in which the target distribution
to be learned is a (1-eps,eps) mixture of some unknown distribution in
the class with some other arbitrary and unknown distribution, and the
learning algorithm must output a hypothesis distribution with total
variation distance error O(eps) from the target distribution. We show
that our general results are close to best possible by proving a
simple information-theoretic lower bound on sample complexity even for
learning bounded distributions which are shift-invariant.
View details
Preview abstract
We show that any smooth bi-Lipschitz $h$ can be represented exactly
as a composition $h_m \circ ... \circ h_1$ of
functions $h_1,...,h_m$ that are close to the identity in
the sense that each $\left(h_i-\Id\right)$ is Lipschitz, and the
Lipschitz constant decreases inversely with the number $m$ of functions
composed.
This implies that $h$ can be represented to any accuracy by
a deep residual network whose nonlinear layers compute functions with
a small Lipschitz constant. Next, we consider a nonlinear regression
problem with a composition of near-identity nonlinear maps. We show
that any critical point of the quadratic criterion in this
near-identity region, with respect to Fr\'echet derivatives of
the respective $h_1,...,h_m$, must be a global minimizer. In contrast,
for residual networks with tanh activiation functions, we show
that there are critical points with respect to
gradient descent on the parameters at near-identity points
that are {\em not} minimizers, even
in the realizable case.
View details
Preview abstract
We analyze algorithms
that aim to approximate
a function
$f(x) = \Phi x$
mapping $\Re^d$ to $\Re^d$ using deep linear
neural networks, i.e.\ a function $h$ parameterized
by matrices $\Theta_1,...,\Theta_L$ defined by
$h(x) = \Theta_L \Theta_{L-1} ... \Theta_1 x$. We focus
on algorithms that learn through gradient descent on the population
quadratic loss in the case that the distribution over the inputs is
isotropic. We provide polynomial bounds on the number of
iterations for gradient descent to approximate the
optimum, in the case where
the initial hypothesis $\Theta_1 = ... = \Theta_L = I$
has loss bounded by a small enough
constant. On the other hand,
we show that gradient descent fails to converge for
$\Phi$ whose distance from the identity
is a larger constant, and we show that some forms
of regularization toward the identity in each layer do
not help.
If $\Phi$ is symmetric positive definite,
we show that an algorithm that initializes $\Theta_i = I$
learns an $\epsilon$-approximation of $f$
using a number of updates polynomial in $L$,
the condition number of $\Phi$, and $\log(d/\epsilon)$. In contrast, we show
that if the $\Phi$ is symmetric and has a
negative eigenvalue, that all members of a class of algorithms
that perform gradient descent with identity initialization,
and optionally regularize toward the identity in each layer, fail to
converge. We analyze an algorithm for
the case that $\Phi$ satisfies $u^{\top} \Phi u > 0$ for all
$u$, but may not
be symmetric; this algorithm uses two regularizers
that maintain the invariants that
$u^{\top} \Theta_L \Theta_{L-1} ... \Theta_1 u > 0$ for all $u$,
and that ``balance'' $\Theta_1 ... \Theta_L$ so that they
have the same singular values.
View details
Preview abstract
We study the learnability of sums of independent integer random
variables given a bound on the size of the union of their
supports. For $\supportset \subset \Z_{+}$,
a {\em sum of independent random variables with collective
support $\supportset$} (called an $\supportset$-sum in this
paper) is a distribution $\bS = \bX_1 + \cdots + \bX_N$ where the
$\bX_i$'s are mutually independent (but not necessarily identically
distributed) integer random variables with $\cup_i \supp(\bX_i) \subseteq
\supportset.$
We give two main algorithmic results for learning such distributions:
\begin{enumerate}
\item
For the case $| \supportset | = 3$,
we give an algorithm for learning
$\supportset$-sums
to accuracy $\eps$ that uses $\poly(1/\eps)$ samples and runs in time $\poly(1/\eps)$, independent of $N$ and of the elements of $\supportset$.
\item For an arbitrary constant $k \geq 4$, if
$\supportset = \{ a_1,...,a_k\}$ with $0 \leq a_1 < ... < a_k$,
we give an algorithm that uses $\poly(1/\eps) \cdot \log \log a_k$ samples (independent of $N$) and runs in time
$\poly(1/\eps, \log a_k).$
\end{enumerate}
We prove an essentially matching lower bound: if $|\supportset| = 4$,
then any algorithm
must use
\[
\Omega(\log \log a_4)
\]
samples even for learning to constant accuracy. We also give similar-in-spirit (but quantitatively very different) algorithmic results, and essentially matching lower bounds, for the case in which $\supportset$ is not known to the learner.
Our learning algorithms employ new limit theorems which may be of independent interest. Our lower bounds rely on equidistribution type results from number theory. Our algorithms and lower bounds together settle the question of how the sample complexity of learning
sums of independent integer random variables
scales with the elements in the union of their supports,
both in the known-support and unknown-support settings. Finally, all our algorithms easily extend to the ``semi-agnostic''
learning model, in which training data is generated from a distribution that
is only $c \epsilon$-close to some
$\supportset$-sum for a constant $c>0$.
View details
Preview abstract
We analyze dropout in deep networks with rectified linear units and the quadratic loss. Our results
expose surprising differences between the behavior of dropout and more traditional regularizers like
weight decay. For example, on some simple data sets dropout training produces negative weights
even though the output is the sum of the inputs. This provides a counterpoint to the suggestion that
dropout discourages co-adaptation of weights. We also show that the dropout penalty can grow
exponentially in the depth of the network while the weight-decay penalty remains essentially linear,
and that dropout is insensitive to various re-scalings of the input features, outputs, and network
weights. This last insensitivity implies that there are no isolated local minima of the dropout training
criterion. Our work uncovers new properties of dropout,
View details
No Results Found