Vineet Gupta

Vineet Gupta

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract In this work, we study the large-scale pretraining of BERT-Large~\citep{devlin2018bert} with differentially private SGD (DP-SGD). We show that combined with a careful implementation, scaling up the batch size to millions (i.e., mega-batches) improves the utility of the DP-SGD step for BERT; we also enhance the training efficiency by using an increasing batch size schedule. Our implementation builds on the recent work of \citet{subramani20}, who demonstrated that the overhead of a DP-SGD step is minimized with effective use of JAX \cite{jax2018github, frostig2018compiling} primitives in conjunction with the XLA compiler \cite{xladocs}. Our implementation achieves a masked language model accuracy of 60.5\% at a batch size of 2M, for $\eps = 5$, which is a reasonable privacy setting. To put this number in perspective, non-private BERT models achieve an accuracy of $\sim$70\%. View details
    Preview abstract Adaptive gradient-based optimizers such as AdaGrad and Adam are among the methods of choice in machine learning. These methods maintain second-moment statistics of each entry in the gradient, thus doubling the optimizer's memory footprint. In behemoth-size applications, this memory overhead restricts the size of the model being used as well as the number of examples in a mini-batch. We describe a novel, simple, and flexible adaptive optimization method with a sublinear memory cost that retains the benefits of per-parameter adaptivity while allowing for larger models and mini-batches. We give convergence guarantees for our method and demonstrate its effectiveness in training very large deep architectures. View details
    Preview abstract We characterize the singular values of the linear transformation associated with a standard 2D multi-channel convolutional layer, enabling their efficient computation. This characterization also leads to an algorithm for projecting a convolutional layer onto an operator-norm ball. We show that this is an effective regularizer; for example, it improves the test error of a deep residual network using batch normalization on CIFAR-10 from 6.2% to 5.3%. View details
    Preview abstract Preconditioned gradient methods are among the most general and powerful toolsin optimization. However, preconditioning requires storing and manipulatingprohibitively large matrices. We describe and analyze a new structure-awarepreconditioning algorithm, called Shampoo, for stochastic optimization overtensor spaces. Shampoo maintains a set of preconditioning matrices, each ofwhich operates on a single dimension, contracting over the remainingdimensions. We establish convergence guarantees in the stochastic convexsetting, the proof of which builds upon matrix trace inequalities. Ourexperiments with state-of-the-art deep learning models show that Shampoo iscapable of converging considerably faster than commonly used optimizers.Surprisingly, although it involves a more complex update rule, Shampoo's runtime per step is comparable in practice to that of simple gradientmethods such as SGD, AdaGrad, and Adam. View details
    Preview abstract We describe a framework for deriving and analyzing online optimization algorithms that incorporate adaptive, data-dependent regularization, also termed preconditioning. Such algorithms have been proven useful in stochastic optimization by reshaping the gradients according to the geometry of the data. Our framework captures and unifies much of the existing literature on adaptive online methods, including the AdaGrad and Online Newton Step algorithms as well as their diagonal versions. As a result, we obtain new convergence proofs for these algorithms that are substantially simpler than previous analyses. Our framework also exposes the rationale for the different preconditioned updates used in common stochastic optimization methods. View details
    Preview abstract We describe and analyze a simple random feature scheme (RFS) from prescribed compositional kernels. The compositional kernels we use are inspired by the structure of convolutional neural networks and kernels. The resulting scheme yields sparse and efficiently computable features. Each random feature can be represented as an algebraic expression over a small number of (random) paths in a composition tree. Thus, compositional random features can be stored compactly. The discrete nature of the generation process enables de-duplication of repeated features, further compacting the representation and increasing the diversity of the embeddings. Our approach complements and can be combined with previous random feature schemes. View details
    Preview abstract We describe a unified adaptive optimization algorithm, which uses a potential function to construct a series of preconditioning matrices that shape the gradient based directions to update the learning parameters. We prove a regret bound for this, and show that various well known algorithms, such as AdaGrad and Online Newton Stepping can be obtained from this by choosing appropriate potential functions. Besides the uniform derivation of regret bounds for all these algorithms, this approach also enables us to vary the potential function and come up with new algorithms. View details
    Preview abstract Messages often refer to entities such as people, places and events. Correct identification of the intended reference is an essential part of communication. Lack of shared unique names often complicates entity reference. Shared knowledge can be used to construct uniquely identifying descriptive references for entities with ambiguous names. We introduce a mathematical model for `Reference by Description', derive results on the conditions under which, with high probability, programs can construct unambiguous references to most entities in the domain of discourse and provide empirical validation of these results. View details
    Approximate reasoning for real-time probabilistic processes
    Radha Jagadeesan
    Prakash Panangaden
    Logical Methods in Computer Science, 2 (2006)
    Preview
    Metrics for labelled Markov processes
    Josee Desharnais
    Radha Jagadeesan
    Prakash Panangaden
    Theor. Comput. Sci., 318 (2004), pp. 323-354
    Preview