Praneeth Netrapalli

Praneeth Netrapalli

My research interests are broadly in designing reliable and robust machine learning (ML) algorithms, representation learning, black-box optimization, time series modeling, quantum optimization and minimax/game theoretic optimization. I am also extremely interested in applying ML techniques to help solve problems from Sciences as well as enable positive social outcomes.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Deep Neural Networks (DNNs) are known to be brittle to even minor distribution shifts compared to the training distribution . Simplicity Bias (SB) of DNNs – bias towards learning a small number of simplest features – has been demonstrated to be a key reason for this brittleness. Prior works have shown that the effect of Simplicity Bias is extreme – even when the features learned are diverse, training the classification head again selects only few of the simplest features, leading to similarly brittle models. In this work, we introduce Feature Reconstruction Regularizer (FRR) in the linear classification head, with the aim of reducing Simplicity Bias, thereby improving Out-Of-Distribution (OOD) robustness. The proposed regularizer when used during linear layer training, termed as FRR-L, enforces that the features can be reconstructed back from the logit layer, ensuring that diverse features participate in the classification task. We further propose to finetune the full network by freezing the weights of the linear layer trained using FRR-L. This approach, termed as FRR-FLFT or Fixed Linear FineTuning, improves the quality of the learned features, making them more suitable for the classification task. Using this simple solution, we demonstrate up to 12% gain in accuracy on the recently introduced synthetic datasets with extreme distribution shifts. Moreover, on the standard OOD benchmarks recommended on DomainBed, our technique can provide up to 5% gains over the existing SOTA methods . View details
    Preview abstract In this work, we consider the problem of collaborative multi-user reinforcement learning. In this setting there are multiple users with the same state-action space and transition probabilities but with different rewards. Under the assumption that the reward matrix of the N users has a low-rank structure -- a standard and practically successful assumption in the offline collaborative filtering setting-- the question is can we design algorithms with significantly lower sample complexity compared to the ones that learn the MDP individually for each user. Our main contribution is an algorithm which explores rewards collaboratively with N user-specific MDPs and can learn rewards efficiently in two key settings: tabular MDPs and linear MDPs. When N is large and the rank is constant, the sample complexity per MDP depends logarithmically over the size of the state-space, which represents an exponential reduction (in the state-space size) when compared to the standard ``non-collaborative'' algorithms. View details
    Preview abstract We consider regression where the noise distribution depends on the covariates (i.e, heteroscedastic noise), which captures popular settings such as linear regression with multiplicative noise occuring due to covariate uncertainty. In particular we consider linear regression where the noise variance is an unknown rank-1 quadratic function of the covariates. While an application of least squares regression can achieve an error rate of $\nicefrac{d}{n}$, this ignores the fact that the magnitude of the noise can be very small for certain values of the covariates, which can aid faster learning. Our algorithm \ouralg~runs a parameter estimation algorithm and a noise distribution model learning algorithm are run alternately, using each other's outputs to iteratively obtain better estimates of the parameter and the noise distribution model respectively. This achieves an error rate of $\nicefrac{1}{n} + \nicefrac{d^2}{n^2}$, which we show is minimax optimal up to logarithmic factors. A sub-routine for \ouralg~performs phase estimation with multiplicative noise maybe of independent interest. View details
    Preview abstract Q-learning is a popular Reinforcement Learning (RL) algorithm which is widely used in practice with function approximation \citep{mnih2015human}. In contrast, existing theoretical results are pessimistic about Q-learning. For example, \citep{baird1995residual} shows that Q-learning does not converge even with linear function approximation for linear MDPs. Furthermore, even for tabular MDPs with synchronous updates, Q-learning was shown to have sub-optimal sample complexity \citep{li2021q,azar2013minimax}. The goal of this work is to bridge the gap between practical success of Q-learning and the relatively pessimistic theoretical results. The starting point of our work is the observation that in practice, Q-learning is used with two important modifications: (i) training with two networks, called online network and target network simultaneously (online target learning, or OTL) , and (ii) experience replay (ER) \citep{mnih2015human}. While they have been observed to play a significant role in the practical success of Q-learning, a thorough theoretical understanding of how these two modifications improve the convergence behavior of Q-learning has been missing in literature. By carefully combining the Q-learning with OTL and \emph{reverse} experience replay (RER) (a form of experience replay), we present novel methods \qrex~and \qrexdr~(\qrex + data reuse). We show that \qrex~efficiently finds the optimal policy for linear MDPs and provide non-asymptotic bounds on sample complexity. Furthermore, we demonstrate that \qrexdr~in fact achieves near optimal sample complexity in the tabular setting, improving upon the existing results for vanilla Q-learning. View details
    Minimax optimization with Smooth Algorithmic Adversaries
    Chi Jin
    Lillian Ratliff
    Tanner Fiez
    International Conference on Learning Representations (ICLR) 2022
    Preview abstract We consider two-player zero sum games or minimax optimization $\min_x \max_y f(x,y)$ in the Stackelberg/sequential setting where the $x$-player goes first and $y$-player goes second, and where $f$ is nonconvex in $x$ and nonconcave in $y$. While such problems are encountered in several popular machine learning paradigms such as in training generative adversarial networks (GANs) and adversarially robust models, there is very little theoretical understanding regarding efficiently computable notions of optimality in this setting. This paper sprouts from the intuition that since nonconcave maximization is NP-hard in general, the $x$-player should aim to play against the \emph{algorithm} employed by $y$-player for maximization, rather than against full maximization over $y$. The key \emph{conceptual} contribution of this paper is that, under the assumption that the $x$-player is aware of the algorithm employed by $y$-player for maximization, we can reformulate the given problem as a nonconvex-\textbf{concave} minimax optimization problem. The key \emph{technical} contributions of this paper are: 1. If the $y$-player employs some popular algorithms such as (stochastic) gradient ascent and Nesterov's accelerated gradient ascent, the resulting nonconvex-\emph{concave} minimax optimization problem is \emph{smooth}, and 2. We leverage recent results for smooth, nonconvex-concave minimax optimization problems to propose new algorithms that find an appropriate local optimum for these problems. Experiments confirm our theoretical findings and demonstrate that the proposed approach works well in practice. View details
    Preview abstract We study the problem of learning vector valued non-linear dynamical systems from a single trajectory of {\em dependent or correlated} points. We assume a known link function $\phi : \mathbb{R} \to \mathbb{R}$ that satisfy a certain {\em expansivity property}. While the problem is well-studied in the linear case with strong learning guarantees even for non-mixing systems, the results in non-linear case hold only for mixing systems and even then the error rates are significantly sub-optimal. In this work, we bridge this gap in a variety of settings: a) we provide first optimal offline algorithm that can learn non-linear dynamical systems without mixing assumption, b) in the much harder one-pass, streaming setting we study a SGD with Reverse Experience Replay ($\sgdber$) method, and demonstrate that for mixing systems, it achieves nearly optimal performance even for heavy-tailed noise, c) we justify the expansivity assumption by showing that when the link function is ReLU --- a non-expansive but easy to learn link function with i.i.d. samples --- any method would require exponentially many samples (with respect to dimension $d$) from the dynamical system. We then compare various algorithms via. simulations and demonstrate that a naive application of SGD can be very sub-optimal. Indeed, our work demonstrates that learning with dependent data efficiently requires specialized algorithm design which is based on the knowledge the dependency structure present. View details
    Preview abstract We consider the problem of estimating a stochastic linear time-invariant (LTI) dynamical system from a single trajectory via streaming algorithms. The problem is equivalent to estimating the parameters of vector auto-regressive ($\var$) models encountered in time series analysis (\cite{hamilton2020time}). A recent sequence of papers \citep{faradonbeh2018finite,simchowitz2018learning,sarkar2019near} show that ordinary least squares (OLS) regression can be used to provide optimal finite time estimator for the problem. However, such techniques apply for {\em offline} setting where the optimal solution of OLS is available {\em apriori}. But, in many problems of interest as encountered in reinforcement learning (RL), it is important to estimate the parameters on the go using gradient oracle. This task is challenging since standard methods like SGD might not perform well when using stochastic gradients from correlated data points \citep{gyorfi1996averaged,nagaraj2020least}. In this work, we propose a novel algorithm, SGD with Reverse Experience Replay ($\sgdber$), that is inspired by the experience replay (ER) technique popular in the RL literature \citep{lin1992self}. $\sgdber$ divides data into small buffers and runs SGD backwards on the data stored in the individual buffers. We show that this algorithm exactly deconstructs the dependency structure and obtains information theoretically optimal guarantees for both parameter error and prediction error for standard problem settings. Thus, we provide the first -- to the best of our knowledge -- optimal SGD-style algorithm for the classical problem of linear system identification aka $\var$ model estimation. Our work demonstrates that knowledge of dependency structure can aid us in designing algorithms which can deconstruct the dependencies between samples optimally in an online fashion. View details
    Preview abstract Meta-learning algorithms synthesizes and leverages the knowledge from a given set of tasks to rapidly learn new tasks using very little data. While methods like ANIL \cite{raghu2019rapid} have been demonstrated to be effective in practical meta-learning problems, their statistical and computational properties are ill-understood. Recent theoretical studies of meta-learning problem in a simple linear/non-linear regression setting still do not explain practical success of the meta-learning approach. For example, existing results either guarantee highly suboptimal estimation errors \cite{tripuraneni2020provable} or require relatively large number of samples per task \cite{}--$\Omega(d)$ samples where $d$ is the data dimensionality--which runs counter to practical settings. Additionally, the prescribed algorithms are inefficient and typically are not used in practice. %to achieve these sample complexity are high inefficient. Similar to the existing studies, we consider the meta-learning problem in linear regression setting, where the regressors lie in a low-dimensional subspace \cite{tripuraneni2020provable}. We analyze two methods -- alternating minimization (MLLAM) and alternating gradient-descent minimization (MLLAM-GD) -- inspired by the popular ANIL~\cite{raghu2019rapid} method. For a constant subspace dimension both these methods obtain nearly-optimal estimation error, despite requiring only $\Omega(\mathrm{polylog}\,d)$ samples per task, which is similar to practical settings where each task has a small number of samples. But our analyses for the methods require the samples per task to grow logarithmically with number of tasks. We remedy this in the low-noise regime by augmenting the algorithms with a novel task subset selection scheme, which guarantees nearly optimal error rates even if the number of samples per task is constant with respect to (wrt) the number of tasks. View details