Phil Long

Phil Long

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Oracle lower bounds for sampling algorithms
    Niladri Chatterji
    Peter Bartlett
    Bernoulli, 28(2)(2022), pp. 1074-1092
    Preview abstract We consider the problem of sampling from a strongly log-concave density in $\Re{d}$, and prove an information theoretic \emph{lower bound} on the number of stochastic gradient queries of the log density needed. Several popular sampling algorithms (including many Markov chain Monte Carlo methods) operate by using stochastic gradients of the log density to generate a sample; our results establish an information theoretic limit for all these algorithms. We show that for every algorithm, there exists a well-conditioned strongly log-concave target density for which the distribution of points generated by the algorithm would be at least $\epsilon$ away from the target in total variation distance if the number of gradient queries is less than $\Omega(\var d/\epsilon^2)$, where $\var d$ is the variance of the stochastic gradient. Our lower bound follows by combining the ideas of Le Cam deficiency routinely used in the comparison of statistical experiments along with standard information theoretic tools used in lower bounding Bayes risk functions. To the best of our knowledge our results provide the first nontrivial dimension-dependent lower bound for this problem. View details
    Preview abstract van Rooyen et al. introduced a notion of convex loss functions being robust to random classification noise, and established that the ``unhinged'' loss function is robust in this sense. In this note we study the accuracy of binary classifiers obtained by minimizing the unhinged loss, and observe that even for simple linearly separable data distributions, minimizing the unhinged loss may only yield a binary classifier with accuracy no better than random guessing. View details
    Preview abstract We prove that gradient descent applied to fixed-width deep networks with the logistic loss converges, and prove bounds on the rate of convergence. Our analysis applies for smoothed approximations to the ReLU proposed in previous applied work such as Swish and the Huberized ReLU. We provide two sufficient conditions for convergence. The first is simply a bound on the loss at initialization. The second is a data separation condition used in prior analyses. View details
    Superlinear integrality gaps for the minimum majority problem
    siam journal on discrete mathematics, 35(4)(2021), pp. 3004-3016 (to appear)
    Preview abstract The minimum majority problem is as follows: given an m-by-n matrix A with {-1,1} entries, minimize x_1 + ... + x_n subject to A x > 0 for x = (x_1,...,x_n), where x_1,...,x_n are non-negative integers. An approximation algorithm that finds a solution with value O(opt^2 log m) in poly(m,n,opt) is known, which can be obtained by rounding a linear programming relaxation. We establish integrality gaps that limit the prospects for improving on this guarantee through improved rounding and/or the application of Lov'asz-Schrijver or Sherali-Adams tightening of the relaxation. These gaps show that these methods cannot improve on the O(opt^2 log m) guarantee by more than a constant factor in polynomial time. View details
    Preview abstract We prove bounds on the population risk of the maximum margin algorithm for two-class linear classification. For linearly separable training data, the maximum margin algorithm has been shown in previous work to be equivalent to a limit of training with logistic loss using gradient descent, as the training error is driven to zero. We analyze this algorithm applied to random data including misclassification noise. Our assumptions on the clean data include the case in which the class-conditional distributions are standard normal distributions. The misclassification noise may be chosen by an adversary, subject to a limit on the fraction of corrupted labels. Our bounds show that, with sufficient overparameterization, the maximum margin algorithm trained on noisy data can achieve nearly optimal population risk. View details
    Preview abstract We study the training of finite-width two-layer (smoothed) ReLU networks for binary classification using the logistic loss. We show that gradient descent drives the training loss to zero, if the initial loss is small enough. When the data satisfies certain cluster and separation conditions, and the network is wide enough, we show that one step of gradient descent reduces the loss sufficiently that the first result applies. (In contrast, all past analyses that we know of fixed-width networks do not guarantee that the training loss goes to zero.) View details
    Preview abstract We consider bounds on the generalization ability of the least-norm linear regressor, in the over-parameterized regime where it can interpolate the data. We describe a sense in which any generalization bound of a type that is commonly proved in statistical learning theory must sometimes be very loose when applied to analyze the least-norm interpolant. In particular, for a variety of natural joint distributions on training examples, any valid generalization bound that depends only on the output of the learning algorithm, the number of training examples, and the confidence parameter, and that satisfies a mild condition (substantially weaker than monotonicity in sample size), must sometimes be very loose —it can be bounded below by a constant when the true excess risk goes to zero. View details
    Benign Overfitting in Linear Regression
    Alexander Tsigler
    Gabor Lugosi
    Peter Bartlett
    PNAS, 117 (48)(2020), pp. 30063-30070
    Preview abstract The phenomenon of benign overfitting is one of the key mysteries uncovered by deep learning methodology: deep neural networks seem to predict well, even with a perfect fit to noisy training data. Motivated by this phenomenon, we consider when a perfect fit to training data in linear regression is compatible with accurate prediction. We give a characterization of gaussian linear regression problems for which the minimum norm interpolating prediction rule has near-optimal prediction accuracy. The characterization is in terms of two notions of the effective rank of the data covariance. It shows that overparameterization is essential for benign overfitting in this setting: the number of directions in parameter space that are unimportant for prediction must significantly exceed the sample size. View details
    Preview abstract For proper distribution-free learning of linear classifiers in d dimensions, we prove a lower bound on the optimal expected error of (d - o(1))/m, improving on the best previous lower bound of (d/sqrt(e) - o(1))/m and nearly matching a (d+1)/(m+1) upper bound achieved by the maximum margin classifier. View details
    Preview abstract We study two generalizations of the mistake bound model to online multiclass classification. In the standard model, the learner receives the correct classification at the end of each round, and in the bandit model, the learner only finds out whether its prediction was correct or not. For a set F of multiclass classifiers, let opts(F) and optb(F) be the optimal bounds for learning F according to these two models. We show that an optb(F) < (1 + o(1)) (|Y| ln |Y|) opts(F) bound is the best possible up to the leading constant, closing a Theta(log |Y|) factor gap. View details