Yoel Drori

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Recommender systems play an important role in YouTube, one of the largest online video platforms across the world. In this paper, we focus on a real-world multitask ranking model for YouTube recommendations. While most of the recommendation research is dedicated to designing better models to improve user engagement and satisfaction, we found that research on stabilizing the training for such models is severely under-explored. As the recommendation models become larger and more sophisticated, they are more vulnerable to training instability issues, \emph{i.e.}, the loss diverges (instead of converging) which can make the model unusable, wasting significant resources and blocking model iterations. In this paper, we share our understanding and best practices we learned for improving the training stability of a multitask ranking model used in production. We show some properties of the model that lead to unstable training and speculate on the cause. Furthermore, we propose an effective solution to improve training stability based on our observations of training dynamics when model training starts to become unstable. Our experiments on a proprietary dataset show the effectiveness of the proposed method over several commonly used baseline methods. View details
    An optimal gradient method for smooth strongly convex minimization
    Adrien B. Taylor
    Mathematical Programming, 199 (2023), 557–594
    Preview abstract We present an optimal gradient method for smooth (possibly strongly) convex optimization. The method is optimal in the sense that its worst-case bound exactly match the lower bound on the oracle complexity for the class of problems, meaning that no black-box first-order method can have a better worst-case guarantees without further assumptions on the class of problems at hand. The method is in some sense a generalization of the Optimized Gradient Method of~\citet{kim2015optimized}, and asymptotically corresponds to the Triple Momentum Method~\citep{van2017fastest}, in the presence of strong convexity. Furthermore, the method is numerically stable to arbitrarily large condition numbers and admits a conceptually very simple proof, which involves a Lyapunov argument and a sum of two inequalities. Finally, we provide a numerical recipe for obtaining the algorithmic parameters of the new method, using semidefinite programming, and illustrate that it can be used for developing other methods as well. View details
    On the oracle complexity of smooth strongly convex minimization
    Adrien B. Taylor
    Journal of Complexity, 68 (2021), pp. 101590 (to appear)
    Preview abstract We construct a family of functions suitable for establishing lower bounds on the oracle complexity of first-order minimization of smooth strongly-convex functions. Based on this construction, we derive new lower bounds on the complexity of strongly-convex minimization under various accuracy criteria. The new bounds match the known upper bounds up to a constant factor, and when the accuracy of a solution is measured by its distance to the solution set, the new lower bound \emph{exactly} matches the upper bound attained by the recent Doubly Optimal Method, thereby establishing the exact complexity for this class of problems. View details
    Preview abstract We consider stochastic optimization with delayed gradients where, at each time step~$t$, the algorithm makes an update using a stale stochastic gradient from step $t - d_t$ for some arbitrary delay $d_t$. This setting abstracts asynchronous distributed optimization where a central server receives gradient updates computed by worker machines. These machines can experience computation and communication loads that might vary significantly over time. In the general non-convex smooth optimization setting, we give a simple and efficient algorithm that requires $O( \sigma^2/\epsilon^4 + \tau/\epsilon^2 )$ steps for finding an $\epsilon$-stationary point $x$, where $\tau$ is the \emph{average} delay $\smash{\frac{1}{T}\sum_{t=1}^T d_t}$ and $\sigma^2$ is the variance of the stochastic gradients. This improves over previous work, which showed that stochastic gradient decent achieves the same rate but with respect to the \emph{maximal} delay $\max_{t} d_t$, that can be significantly larger than the average delay especially in heterogeneous distributed systems. Our experiments demonstrate the efficacy and robustness of our algorithm in cases where the delay distribution is skewed or heavy-tailed. View details
    On the Properties of Convex Functions over Open Sets
    Journal of Convex Analysis, 27 (2020), pp. 1303-1314
    Preview abstract We consider the class of smooth convex functions defined over an open convex set. We show that this class is essentially different than the class of smooth convex functions defined over the entire linear space by exhibiting a function that belongs to the former class but cannot be extended to the entire linear space while keeping its properties. We proceed by deriving new properties of the class under consideration, including an inequality that is strictly stronger than the classical Descent Lemma. View details
    Preview abstract We study the iteration complexity of stochastic gradient descent (SGD) for minimizing the gradient norm of smooth, possibly nonconvex functions. We provide several results, implying that the classical $\mathcal{O}(\epsilon^{-4})$ upper bound (for making the average gradient norm less than $\epsilon$) cannot be improved upon, unless a combination of additional assumptions is made. Notably, this holds even if we limit ourselves to convex quadratic functions. We also show that for nonconvex functions, the feasibility of minimizing gradients with SGD is surprisingly sensitive to the choice of optimality criteria. View details
    Preview abstract We study a variant of domain adaptation for named-entity recognition where multiple, heterogeneously tagged training sets are available. Furthermore, the test tag-set is not identical to any individual training tag-set. Yet, the relations between all tags are provided in a tag hierarchy, covering the test tags as a combination of training tags. This setting occurs when various datasets are created using different annotation schemes. This is also the case of extending a tag-set with a new tag by annotating only the new tag in a new dataset. We propose to use the given tag hierarchy to jointly learn a neural network that shares its tagging layer among all tag-sets. We compare this model to combining independent models and to a model based on the multitasking approach. Our experiments show the benefit of the tag-hierarchy model, especially when facing non-trivial consolidation of tag-sets. View details
    Preview abstract We describe a novel constructive technique for devising efficient first-order methods for a wide range of large-scale convex minimization settings, including smooth, nonsmooth, and strongly convex minimization. This novel design technique takes a method performing a series of subspace-searches and constructs a fixed-step method that has the same worst-case guarantees. We show that this technique yields optimal algorithms in the smooth and nonsmooth cases, and use it to devise a new method for smooth strongly convex minimization with an improved convergence rate as compared to Nesterov's celebrated fast gradient methods. View details
    Preview abstract We obtain a new lower bound on the information-based complexity of first-order minimization of smooth and convex functions. We show that the bound matches the worst-case performance of the recently introduced Optimized Gradient Method (Drori and Teboulle, 2013; Kim and Fessler, 2015), thereby establishing that the bound is tight and can be realized by an efficient algorithm. The proof is based on a novel construction technique of smooth and convex functions. View details