Jump to Content
Krzysztof Choromanski

Krzysztof Choromanski

Krzysztof Choromanski works on several aspects of machine learning and robotics. His current research interests include reinforcement learning and randomized methods such as nonlinear embeddings based on structured random feature maps and quasi-Monte-Carlo methods. He was also working on online nonparametric clustering for massive high-dimensional streams. Krzysztof is an author of several nonlinear embedding mechanisms based on structured matrices that can be used to speed up: neural network computations, kernel methods applying random feature maps, convex optimization solvers, quantization and soft clustering methods as well as several LSH-based algorithms. With his background in structural graph theory, he is also interested in applying graph theory and other combinatorial methods in machine learning.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Robotic Table Tennis: A Case Study into a High Speed Learning System
    Jon Abelian
    Saminda Abeyruwan
    Michael Ahn
    Justin Boyd
    Erwin Johan Coumans
    Omar Escareno
    Wenbo Gao
    Navdeep Jaitly
    Juhana Kangaspunta
    Satoshi Kataoka
    Gus Kouretas
    Yuheng Kuang
    Corey Lynch
    Thinh Nguyen
    Ken Oslund
    Barney J. Reed
    Anish Shankar
    Avi Singh
    Grace Vesom
    Peng Xu
    Robotics: Science and Systems (2023)
    Preview abstract We present a deep-dive into a learning robotic system that, in previous work, was shown to be capable of hundreds of table tennis rallies with a human and has the ability to precisely return the ball to desired targets. This system puts together a highly optimized and novel perception subsystem, a high-speed low-latency robot controller, a simulation paradigm that can prevent damage in the real world and also train policies for zero-shot transfer, and automated real world environment resets that enable autonomous training and evaluation on physical robots. We complement a complete system description including numerous design decisions that are typically not widely disseminated, with a collection of ablation studies that clarify the importance of mitigating various sources of latency, accounting for training and deployment distribution shifts, robustness of the perception system, and sensitivity to policy hyper-parameters and choice of action space. A video demonstrating the components of our system and details of experimental results is included in the supplementary material. View details
    Preview abstract We address a benchmark task in agile robotics: catching objects thrown at high-speed. This is a challenging task that involves tracking, intercepting, and cradling a thrown object with access only to visual observations of the object and the proprioceptive state of the robot, all within a fraction of a second. We present the relative merits of two fundamentally different solution strategies: (i) Model Predictive Control using accelerated constrained trajectory optimization, and (ii) Reinforcement Learning using zeroth-order optimization. We provide insights into various performance tradeoffs including sample efficiency, sim-to-real transfer, robustness to distribution shifts, and wholebody multimodality via extensive on-hardware experiments. We conclude with proposals on fusing “classical” and “learning-based” techniques for agile robot control. Videos of our experiments may be found here: https://sites.google.com/view/agile-catching. View details
    Chefs’ Random Tables: Non-Trigonometric Random Features
    Valerii Likhosherstov
    Avinava Dubey
    Frederick Liu
    Adrian Weller
    NeurIPS 2022 (2022) (to appear)
    Preview abstract We present \textit{chefs' random tables} (CRTs), a new class of non-trigonometric random features (RFs) to approximate Gaussian and softmax kernels. CRTs are an alternative to standard random kitchen sink (RKS) methods, which inherently rely on the trigonometric maps. We present variants of CRTs where RFs are positive -- a critical requirement for prominent applications in recent low-rank Transformers. Further variance reduction is possible by leveraging statistics which are simple to compute. One instantiation of CRTs, the FAVOR++ mechanism, is to our knowledge the first RF method for unbiased softmax kernel estimation with positive \& bounded RFs, resulting in strong concentration results characterized by exponentially small tails, and much lower variance. As we show, orthogonal random features applied in FAVOR++ provide additional variance reduction for any dimensionality $d$ (not only asymptotically for sufficiently large $d$, as for RKS). We exhaustively test CRTs and FAVOR++ on tasks ranging from non-parametric classification to training Transformers for text, speech and image data, obtaining new SOTA for low-rank text Transformers, while providing linear space \& time complexity. View details
    Preview abstract Sim-to-real transfer is a powerful paradigm for robotic reinforcement learning. The ability to train policies in simulation enables safe exploration and large-scale data collection quickly at low cost. However, prior works in sim-to-real transfer of robotic policies typically do not involve any human-robot interaction because accurately simulating human behavior is an open problem. In this work, our goal is to leverage the power of simulation to train robotic policies that are proficient at interacting with humans upon deployment. This presents a chicken-and-egg problem --- how to gather examples of a human interacting with a physical robot so as to model human behavior in simulation without already having a robot that is able to interact with a human? Our proposed method, Iterative-Sim-to-Real i-S2R), attempts to address this. i-S2R bootstraps from a simple model of human behavior and alternates between training in simulation and deploying in the real world. In each iteration, both the human behavior model and the policy are improved. We evaluate our method on a real world robotic table tennis setting, where the objective for the robot is to play cooperatively with a human player for as long as possible. Table tennis is a high-speed, dynamic task that requires the two players to react quickly to each other’s moves, making for a challenging test bed for research on human-robot interaction. We present results on a physical industrial robotic arm that is able to cooperatively play table tennis against human players, achieving rallies of 22 successive hits on average and 150 at best. Further, for 80% of players, rally lengths are 70% to 175% longer compared to the sim-to-real (S2R) baseline. View details
    Learning Model Predictive Controllers with Real-Time Attention for Real-World Navigation
    Anthony G. Francis
    Dmitry Kalashnikov
    Edward Lee
    Jake Varley
    Leila Takayama
    Mikael Persson
    Peng Xu
    Stephen Tu
    Xuesu Xiao
    Conference on Robot Learning (2022) (to appear)
    Preview abstract Despite decades of research, existing navigation systems still face real-world challenges when being deployed in the wild, e.g., in cluttered home environments or in human-occupied public spaces. To address this, we present a new class of implicit control policies combining the benefits of imitation learning with the robust handling of system constraints of Model Predictive Control (MPC). Our approach, called Performer-MPC, uses a learned cost function parameterized by vision context embeddings provided by Performers---a low-rank implicit-attention Transformer. We jointly train the cost function and construct the controller relying on it, effectively solving end-to-end the corresponding bi-level optimization problem. We show that the resulting policy improves standard MPC performance by leveraging a few expert demonstrations of the desired navigation behavior in different challenging real-world scenarios. Compared with a standard MPC policy, Performer-MPC achieves 40% better goal reached in cluttered environments and 65% better sociability when navigating around humans. View details
    Hybrid Random Features
    Haoxian Chen
    Han Lin
    Yuanzhe Ma
    Arijit Sehanobish
    Michael Ryoo
    Jake Varley
    Andy Zeng
    Valerii Likhosherstov
    Dmitry Kalashnikov
    Adrian Weller
    International Conference on Learning Representations (ICLR) (2022)
    Preview abstract We propose a new class of random feature methods for linearizing softmax and Gaussian kernels called hybrid random features (HRFs) that automatically adapt the quality of kernel estimation to provide most accurate approximation in the defined regions of interest. Special instantiations of HRFs lead to well-known methods such as trigonometric (Rahimi & Recht, 2007) or (recently introduced in the context of linear-attention Transformers) positive random features (Choromanski et al., 2021b). By generalizing Bochner’s Theorem for softmax/Gaussian kernels and leveraging random features for compositional kernels, the HRF-mechanism provides strong theoretical guarantees - unbiased approximation and strictly smaller worst-case relative errors than its counterparts. We conduct exhaustive empirical evaluation of HRF ranging from pointwise kernel estimation experiments, through tests on data admitting clustering structure to benchmarking implicit-attention Transformers (also for downstream Robotics applications), demonstrating its quality in a wide spectrum of machine learning problems. View details
    Preview abstract Large pretrained (e.g., "foundation") models exhibit distinct capabilities depending on the domain of data they are trained on. While these domains are generic, they may only barely overlap. For example, visual-language models (VLMs) are trained on Internet-scale image captions, but large language models (LMs) are further trained on Internet-scale text with no images (e.g., spreadsheets, SAT questions, code). As a result, these models store different forms of commonsense knowledge across different domains. In this work, we show that this diversity is symbiotic, and can be leveraged through Socratic Models (SMs): a modular framework in which multiple pretrained models may be composed zero-shot i.e., via multimodal-informed prompting, to exchange information with each other and capture new multimodal capabilities, without requiring finetuning. With minimal engineering, SMs are not only competitive with state-of-the-art zero-shot image captioning and video-to-text retrieval, but also enable new applications such as (i) answering free-form questions about egocentric video, (ii) engaging in multimodal assistive dialogue with people (e.g., for cooking recipes) by interfacing with external APIs and databases (e.g., web search), and (iii) robot perception and planning. Prototypes are available at socraticmodels.github.io View details
    Sub-Linear Memory: How to Make Performers SLiM
    Valerii Likhosherstov
    Jared Davis
    Adrian Weller
    NeurIPS 2021
    Preview abstract The Transformer architecture has revolutionized deep learning on sequential data, becoming ubiquitous in state-of-the-art solutions for a wide variety of applications. Yet vanilla Transformers are notoriously resource-expensive, requiring O(L^2) in serial time and memory as functions of input length L. Recent works proposed various linear self-attention mechanisms, scaling only as O(L) for serial computation. We perform a thorough analysis of recent Transformer mechanisms with linear self-attention, Performers, in terms of overall computational complexity. We observe a remarkable computational flexibility: forward and backward propagation can be performed with no approximations using sub-linear memory as a function of L (in addition to negligible storage for the input sequence), at a cost of greater time complexity in the parallel setting. In the extreme case, a Performer consumes only O(1) memory during training, and still requires O(L) time. This discovered time-memory tradeoff can be used for training or, due to complete backward-compatibility, for fine-tuning on a low-memory device, e.g. a smartphone or an earlier-generation GPU, thus contributing towards decentralized and democratized deep learning. View details
    Rethinking Attention with Performers
    Valerii Likhosherstov
    David Martin Dohan
    Peter Hawkins
    Jared Quincy Davis
    Lukasz Kaiser
    Adrian Weller
    accepted to ICLR 2021 (oral presentation) (to appear)
    Preview abstract We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. To approximate softmax attention-kernels, Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+), which may be of independent interest for scalable kernel methods. FAVOR+ can be also used to efficiently model kernelizable attention mechanisms beyond softmax. This representational power is crucial to accurately compare softmax with other kernels for the first time on large-scale tasks, beyond the reach of regular Transformers, and investigate optimal attention-kernels. Performers are linear architectures fully compatible with regular Transformers and with strong theoretical guarantees: unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence and low estimation variance. We tested Performers on a rich set of tasks stretching from pixel-prediction through text models to protein sequence modeling. We demonstrate competitive results with other examined efficient sparse and dense attention methods, showcasing effectiveness of the novel attention-learning paradigm leveraged by Performers. View details
    Debiasing a First-order Heuristic for Approximate Bi-level Optimization
    Valerii Likhosherstov
    Jared Davis
    Adrian Weller
    Thirty-eighth International Conference on Machine Learning (ICML 2021)
    Preview abstract Approximate bi-level optimization (ABLO) consists of (outer-level) optimization problems, involving numerical (inner-level) optimization loops. While ABLO has many applications across deep learning, it suffers from time and memory complexity proportional to the length r of its inner optimization loop. To address this complexity, an earlier first-order method (FOM) was proposed as a heuristic that omits second derivative terms, yielding significant speed gains and requiring only constant memory. Despite FOM's popularity, there is a lack of theoretical understanding of its convergence properties. We contribute by theoretically characterizing FOM's gradient bias under mild assumptions. We further demonstrate a rich family of examples where FOM-based SGD does not converge to a stationary point of the ABLO objective. We address this concern by proposing an unbiased FOM (UFOM) enjoying constant memory complexity as a function of r. We characterize the introduced time-variance tradeoff, demonstrate convergence bounds, and find an optimal UFOM for a given ABLO problem. Finally, we propose an efficient adaptive UFOM scheme. View details
    Preview abstract Leveraging machine-learning (ML) techniques for compiler optimizations has been widely studied and explored in academia. However, the adoption of ML in general-purpose, industry strength compilers has yet to happen. We propose MLGO, a framework for integrating ML techniques systematically in an industrial compiler -- LLVM. As a case study, we present the details and results of replacing the heuristics-based inlining-for-size optimization in LLVM with machine learned models. To the best of our knowledge, this work is the first full integration of ML in a complex compiler pass in a real-world setting. It is available in the main LLVM repository. We use two different ML algorithms: Policy Gradient and Evolution Strategies, to train the inlining-for-size model, and achieve up to 7\% size reduction, when compared to state of the art LLVM -Oz. The same model, trained on one corpus, generalizes well to a diversity of real-world targets, as well as to the same set of targets after months of active development. This property of the trained models is beneficial to deploy ML techniques in real-world settings. View details
    Tournaments and the Strong Erdos-Hajnal Property
    Eli Berger
    Maria Chudnovsky
    Shira Zerbib
    European Journal of Combinatorics (2021)
    Preview abstract A conjecture of Alon, Pach and Solymosi, which is equivalent to the celebrated Erdős-Hajnal Conjecture, states that for every tournament S there exists epsilon(S)>0 such that if T is an n-vertex tournament that does not contains S as a subtournament, then T contains a transitive subtournament on at least n^{epsilon(S)} vertices. Let C_5 be the unique five-vertex tournament where every vertex has two inneighbors and two outneighbors. The Alon-Pach-Solymosi conjecture is known to be true for the case when S=C_5. Here we prove a strengthening of this result, showing that in every tournament $T$ with no subtournament isomorphic to C_5 there exist disjoint vertex subsets A and B, each containing a linear proportion of the vertices of T, and such that every vertex of A is adjacent to every vertex of B. View details
    Catformer: Designing Stable Transformers via Sensitivity Analysis
    Jared Quincy Davis
    Albert Gu
    Tri Dao
    Christopher Re
    Chelsea Finn
    Percy Liang
    Thirty-eighth International Conference on Machine Learning (ICML 2021)
    Preview abstract Transformer architectures are widely used, but training them is non-trivial, requiring custom learning rate schedules, scaling terms, residual connections, careful placement of submodules such as normalization, and so on. In this paper, we improve upon recent analysis of Transformers and formalize a notion of sensitivity to capture the difficulty of training. Sensitivity characterizes how the variance of activation and gradient norms change in expectation when parameters are randomly perturbed. We analyze the sensitivity of previous Transformer architectures and design a new architecture, the Catformer, which replaces residual connections or RNN-based gating mechanisms with concatenation. We prove that Catformers are less sensitive than other Transformer variants and demonstrate that this leads to more stable training. On DMLab30, a suite of high-dimension reinforcement tasks, Catformer outperforms other transformers, including Gated Transformer-XL—the state-of-the-art architecture designed to address stability—by 13%. View details
    Rapidly Adaptable Legged Robots via Evolutionary Meta-Learning
    Yuxiang Yang
    Wenbo Gao
    Chelsea Finn
    International Conference on Intelligent Robots and Systems (IROS) (2020) (to appear)
    Preview abstract Learning adaptable policies is crucial for robots to operate autonomously in our complex and quickly changing world. In this work, we present a new meta-learning method that allows robots to quickly adapt to changes in dynamics. In contrast to gradient-based meta-learning algorithms that rely on second-order gradient estimation, we introduce a more noise-tolerant Batch Hill-Climbing adaptation operator and combine it with meta-learning based on evolutionary strategies. Our method significantly improves adaptation to changes in dynamics in high noise settings, which are common in robotics applications. We validate our approach on a quadruped robot that learns to walk while subject to changes in dynamics. We observe that our method significantly outperforms prior gradient-based approaches, enabling the robot to adapt its policy to changes based on less than 3 minutes of real data. View details
    Preview abstract We propose a model-free algorithm for learning efficient policies capable of returning table tennis balls by controlling robot joints at a rate of 100Hz. We demonstrate that evolutionary search (ES) methods acting on CNN-based policy architectures for non-visual inputs and convolving across time learn compact controllers leading to smooth motions. Furthermore, we show that with appropriately tuned curriculum learning on the task and rewards, policies are capable of developing multi-modal styles, specifically forehand and backhand stroke, whilst achieving 80\% return rate on a wide range of ball throws. We observe that multi-modality does not require any architectural priors, such as multi-head architectures or hierarchical policies. View details
    Unsupervised Anomaly Detection for Self-flying Delivery Drones
    Hakim Sidahmed
    Brandon Jones
    International Conference on Robotics and Automation (2020)
    Preview abstract We propose a novel anomaly detection framework for a fleet of hybrid aerial vehicles executing high-speed package pickup and delivery missions. The detection is based on machine learning models of normal flight profiles, trained on millions of flight log measurements of control inputs and sensor readings. We develop a new scalable algorithm for robust regression which can simultaneously fit predictive flight dynamics models while identifying and discarding abnormal flight missions from the training set. The resulting unsupervised estimator has a very high breakdown point and can withstand massive contamination of training data to uncover what normal flight patterns look like, without requiring any form of prior knowledge of aircraft aerodynamics or manual labeling of anomalies upfront. Across many different anomaly types, spanning simple 3-sigma statistical thresholds to turbulence and other equipment anomalies, our models achieve high detection rates across the board. Our method consistently outperforms alternative robust detection methods on synthetic benchmark problems. To the best of our knowledge, dynamics modeling of hybrid delivery drones for anomaly detection at the scale of 100 million measurements from 5000 real flight missions in variable flight conditions is unprecedented. View details
    Variance Reduction for Evolution Strategies via Structured Control Variates
    Yunhao Tang
    Alp Kucukelbir
    The 23nd International Conference on Artificial Intelligence and Statistics (AISTATS 2020) (to appear)
    Preview abstract Evolution Strategies (ES) are a powerful class of blackbox optimization techniques that recently became a competitive alternative to state-of-the-art policy gradient (PG) algorithms for reinforcement learning (RL). We propose a new method for improving accuracy of the ES algorithms, that as opposed to recent approaches utilizing only Monte Carlo structure of the gradient estimator, takes advantage of the underlying Markov Decision Process (MDP) structure to reduce the variance. We observe that the gradient estimator of the ES objective can be alternatively computed using reparametrization and PG estimators, which leads to new control variate techniques for gradient estimation in ES optimization. We provide theoretical insights and show through extensive experiments that this RL-specific variance reduction approach outperforms general purpose variance reduction methods. View details
    Ready Policy One: World Building Through Active Learning
    Philip Ball
    Jack Parker-Holder
    Aldo Pacchiano
    Stephen Roberts
    Thirty-seventh International Conference on Machine Learning (ICML 2020) (to appear)
    Preview abstract Model-Based Reinforcement Learning (MBRL) offers a promising direction for sample efficient learning, often achieving state of the art results for continuous control tasks. However, many existing MBRL methods rely on combining greedy policies with exploration heuristics, and even those which utilize principled exploration bonuses construct dual objectives in an ad hoc fashion. In this paper we introduce Ready Policy One (RP1), a framework that views MBRL as an active learning problem, where we aim to improve the world model in the fewest samples possible. RP1 achieves this by utilizing a hybrid objective function, which crucially adapts during optimization, allowing the algorithm to trade off reward v.s. exploration at different stages of learning. In addition, we introduce a principled mechanism to terminate sample collection once we have a rich enough trajectory batch to improve the model. We rigorously evaluate our method on a variety of continuous control tasks, and demonstrate statistically significant gains over existing approaches. View details
    Practical Nonisotropic Monte Carlo Sampling in High Dimensions via Determinantal Point Processes
    Aldo Pacchiano
    Jack Parker-Holder
    Yunhao Tang
    The 23nd International Conference on Artificial Intelligence and Statistics (AISTATS 2020) (to appear)
    Preview abstract We propose a new class of practical structured methods for nonisotropic Monte Carlo (MC) sampling, called DPPMC, designed for high-dimensional nonisotropic distributions where samples are correlated to reduce the variance of the estimator via determinantal point processes. We successfully apply DPPMCs to high-dimensional problems involving nonisotropic distributions arising in guided evolution strategy (GES) methods for reinforcement learning (RL), CMA-ES techniques and trust region algorithms for blackbox optimization, improving state-of-the-art in all these settings. In particular, we show that DPPMCs drastically improve exploration profiles of the existing evolution strategy algorithms. We further confirm our results, analyzing random feature map estimators for Gaussian mixture kernels. We provide theoretical justification of our empirical results, showing a connection between DPPMCs and recently introduced structured orthogonal MC methods for isotropic distributions. View details
    Demystifying Orthogonal Monte Carlo and Beyond
    Han Lin
    Haoxian Chen
    Tianyi Zhang
    Clement Laroche
    NeurIPS 2020 (2020)
    Preview abstract Orthogonal Monte Carlo (OMC) is a very effective sampling algorithm imposing structural geometric conditions (orthogonality) on samples for variance reduction. Due to its simplicity and superior performance as compared to its Quasi Monte Carlo counterparts, OMC is used in a wide spectrum of challenging machine learning applications ranging from scalable kernel methods to predictive recurrent neural networks, generative models and reinforcement learning. However theoretical understanding of the method remains very limited. In this paper we shed new light on the theoretical principles behind OMC, applying theory of negatively dependent random variables to obtain several new concentration results. We also propose a novel extensions of the method leveraging number theory techniques and particle algorithms, called Near-Orthogonal Monte Carlo (NOMC). We show that NOMC is the first algorithm consistently outperforming OMC in applications ranging from kernel methods to approximating distances in probabilistic metric spaces. View details
    Effective Diversity in Population-Based Reinforcement Learning
    Aldo Pacchiano
    Jack Parker-Holder
    Stephen Roberts
    NeurIPS 2020 (spotlight) (2020)
    Preview abstract Exploration is a key problem in reinforcement learning, since agents can only learn from data they acquire in the environment. With that in mind, maintaining a population of agents is an attractive method, as it allows data be collected with a diverse set of behaviors. This behavioral diversity is often boosted via multi-objective loss functions. However, those approaches typically leverage mean field updates based on pairwise distances, which makes them susceptible to cycling behaviors and increased redundancy. In addition, explicitly boosting diversity often has a detrimental impact on optimizing already fruitful behaviors for rewards. As such, the reward-diversity trade off typically relies on heuristics. Finally, such methods require behavioral representations, often handcrafted and domain specific. In this paper, we introduce an approach to optimize all members of a population simultaneously. Rather than using pairwise distance, we measure the volume of the entire population in a behavioral manifold, defined by task-agnostic behavioral embeddings. In addition, our algorithm Diversity via Determinants (DvD), adapts the degree of diversity during training using online learning techniques. We introduce both evolutionary and gradient-based instantiations of DvD and show they effectively improve exploration without reducing performance when better exploration is not required. View details
    Learning to Score Behaviors for Guided Policy Optimization
    Aldo Pacchiano
    Jack Parker-Holder
    Yunhao Tang
    Anna Choromanska
    Michael I. Jordan
    Thirty-seventh International Conference on Machine Learning (ICML 2020) (to appear)
    Preview abstract We introduce a new approach for comparing reinforcement learning policies, using Wasserstein distances (WDs) in a newly defined latent behavioral space. We show that by utilizing the dual formulation of the WD, we can learn score functions over policy behaviors that can in turn be used to lead policy optimization towards (or away from) (un)desired behaviors. Combined with smoothed WDs, the dual formulation allows us to devise efficient algorithms that take stochastic gradient descent steps through WD regularizers. We incorporate these regularizers into two novel on-policy algorithms, Behavior-Guided Policy Gradient and Behavior-Guided Evolution Strategies, which we demonstrate can outperform existing methods in a variety of challenging environments. We also provide an open source demo. View details
    Stochastic Flows and Geometric Optimization on the Orthogonal Group
    David Cheikhi
    Jared Davis
    Valerii Likhosherstov
    Achille Nazaret
    Achraf Bahamou
    Xingyou Song
    Mrugank Akarte
    Jack Parker-Holder
    Jacob Bergquist
    Yuan Gao
    Aldo Pacchiano
    Adrian Weller
    Thirty-seventh International Conference on Machine Learning (ICML 2020) (to appear)
    Preview abstract We present a new class of stochastic, geometrically-driven optimization algorithms on the orthogonal group O(d) and naturally reductive homogeneous manifolds obtained from the action of the rotation group SO(d). We theoretically and experimentally demonstrate that our methods can be applied in various fields of machine learning including deep, convolutional and recurrent neural networks, reinforcement learning, normalizing flows and metric learning. We show an intriguing connection between efficient stochastic optimization on the orthogonal group and graph theory (e.g. matching problem, partition functions over graphs, graph-coloring). We leverage the theory of Lie groups and provide theoretical results for the designed class of algorithms. We demonstrate broad applicability of our methods by showing strong performance on the seemingly unrelated tasks of learning world models to obtain stable policies for the most difficult Humanoid agent from OpenAI Gym and improving convolutional neural networks. View details
    An Ode to an ODE
    Jared Quincy Davis
    Valerii Likhosherstov
    Jean-Jacques Slotine
    Jake Varley
    Honglak Lee
    Adrian Weller
    NeurIPS 2020 (2020)
    Preview abstract We present a new paradigm for Neural ODE algorithms, called ODEtoODE, where time-dependent parameters of the main flow evolve according to a matrix flow on the orthogonal group O(d). This nested system of two flows, where the parameter-flow is constrained to lie on the compact manifold, provides stability and effectiveness of training and provably solves the gradient vanishing-explosion problem which is intrinsically related to training deep neural network architectures such as Neural ODEs. Consequently, it leads to better downstream models, as we show on the example of training reinforcement learning policies with evolution strategies, and in the supervised learning setting, by comparing with previous SOTA baselines. We provide strong convergence results for our proposed mechanism that are independent of the depth of the network, supporting our empirical studies. Our results show an intriguing connection between the theory of deep neural networks and the field of matrix flows on compact manifolds. View details
    From Complexity to Simplicity: Adaptive ES-Active Subspaces for Blackbox Optimization
    Aldo Pacchiano
    Jack Parker-Holder
    Yunhao Tang
    Thirty-third Conference on Neural Information Processing Systems (NeurIPS 2019)
    Preview abstract We present a new algorithm ASEBO for optimizing high-dimensional blackbox functions. ASEBO adapts to the geometry of the function and learns optimal sets of sensing directions, which are used to probe it, on-the-fly. It addresses the exploration-exploitation trade-off of blackbox optimization with expensive blackbox queries by continuously learning the bias of the lower-dimensional model used to approximate gradients of smoothings of the function via compressed sensing and contextual bandits methods. To obtain this model, it leverages techniques from the emerging theory of active subspaces in the novel ES blackbox optimization context. As a result, ASEBO learns the dynamically changing intrinsic dimensionality of the gradient space and adapts to the hardness of different stages of the optimization without external supervision. Consequently, it leads to more sample-efficient blackbox optimization than state-of-the-art algorithms. We provide theoretical results and test ASEBO advantages over other methods empirically by evaluating it on the set of reinforcement learning policy optimization tasks as well as functions from the recently open-sourced Nevergrad library. View details
    KAMA-NNs: Low-dimensional Rotation Based Neural Networks
    Aldo Pacchiano
    Jeffrey Pennington
    Yunhao Tang
    The 22nd International Conference on Artificial Intelligence and Statistics (AISTATS 2019)
    Preview abstract We present new architectures for feedforward neural networks built from products of learned or random low-dimensional rotations that offer substantial space compression and computational speedups in comparison to the unstructured baselines. Models using them are also competitive with the baselines and often, due to imposed orthogonal structure, outperform baselines accuracy-wise. We propose to use our architectures in two settings. We show that in the non-adaptive scenario (random neural networks) they lead to asymptotically more accurate, space-efficient and faster estimators of the so-called PNG-kernels (for any activation function defining the PNG). This generalizes several recent theoretical results about orthogonal estimators (e.g. orthogonal JLTs, orthogonal estimators of angular kernels and more). In the adaptive setting we propose efficient algorithms for learning products of low-dimensional rotations and show how our architectures can be used to improve space and time complexity of state of the art reinforcement learning (RL) algorithms (e.g. PPO, TRPO). Here they offer up to 7x compression of the network in comparison to the unstructured baselines and outperform reward-wise state of the art structured neural networks offering similar computational gains and based on low displacement rank matrices. View details
    Preview abstract Interest in derivative-free optimization (DFO) and "evolutionary strategies" (ES) has recently surged in the Reinforcement Learning (RL) community, with growing evidence that they can match state of the art methods for policy optimization problems in Robotics. However, it is well known that DFO methods suffer from prohibitively high sampling complexity. They can also be very sensitive to noisy rewards and stochastic dynamics. In this paper, we propose a new class of algorithms, called Robust Blackbox Optimization (RBO). Remarkably, even if up to 23% of all the measurements are arbitrarily corrupted, RBO can provably recover gradients to high accuracy. RBO relies on learning gradient flows using robust regression methods to enable off-policy updates. On several MuJoCo robot control tasks, when all other RL approaches collapse in the presence of adversarial noise, RBO is able to train policies effectively. We also show that RBO can be applied to legged locomotion tasks including path tracking for quadruped robots. View details
    Unifying Orthogonal Monte Carlo Methods
    Mark Rowland
    Wenyu Chen
    Adrian Weller
    Thirty-sixth International Conference on Machine Learning (ICML 2019)
    Preview abstract Many machine learning methods making use of Monte Carlo sampling in vector spaces have been shown to be improved by conditioning samples to be mutually orthogonal. Exact orthogonal coupling of samples is computationally intensive, hence approximate methods have been of great interest. In this paper, we present a unifying perspective of many approximate methods by considering Givens transformations, propose new approximate methods based on this framework, and demonstrate the first statistical guarantees for families of approximate methods in kernel approximation. We provide extensive empirical evaluations with guidance for practitioners. View details
    Orthogonal Estimation of Wasserstein Distances
    Mark Rowland
    Jiri Hron
    Yunhao Tang
    Adrian Weller
    The 22nd International Conference on Artificial Intelligence and Statistics (AISTATS 2019)
    Preview abstract Wasserstein distances are increasingly used in a wide variety of applications in machine learning. Sliced Wasserstein distances form an important subclass which may be estimated efficiently through one-dimensional sorting operations. In this paper, we propose a new variant of sliced Wasserstein distance, study the use of orthogonal coupling in Monte Carlo estimation of Wasserstein distances and draw connections with stratified sampling, and evaluate our approaches experimentally in a range of large-scale experiments in generative modelling and reinforcement learning. View details
    ES-MAML: Simple Hessian-Free Meta Learning
    Wenbo Gao
    Yuxiang Yang
    Aldo Pacchiano
    Yunhao Tang
    Eighth International Conference on Learning Representations (ICLR 2020) (2019)
    Preview abstract We introduce ES-MAML, a new framework for solving the model agnostic meta learning (MAML) problem based on Evolution Strategies (ES). Existing algorithms for MAML are based on policy gradients, and incur significant difficulties when attempting to estimate second derivatives using backpropagation on stochastic policies. We show how ES can be applied to MAML to obtain an algorithm which avoids the problem of estimating second derivatives, and is also conceptually simple and easy to implement. Moreover, ES-MAML can handle new types of nonsmooth adaptation operators, and other techniques for improving performance and estimation of ES methods become applicable. We show empirically that ES-MAML is competitive with existing methods and often yields better adaptation with fewer queries. View details
    On the Erdos-Hajnal conjecture for six-vertex tournaments
    Eli Berger
    Maria Chudnovsky
    European Journal of Combinatorics (2018) (to appear)
    Preview abstract A celebrated unresolved conjecture of Erdos and Hajnal states that for every undirected graph H there exists (H) > 0 such that every undirected graph on n vertices that does not contain H as an induced subgraph contains a clique or stable set of size at least n(H). The conjecture has a directed equivalent version stating that for every tournament H there exists (H) > 0 such that every H-free n-vertex tournament T contains a transitive subtournament of order at least n(H). We say that a tournament is prime if it does not have nontrivial homogeneous sets. So far the conjecture was proved only for some specific families of prime tournaments and tournaments constructed according to the so-called substitution procedure. In particular, recently the conjecture was proved for all five-vertex tournaments, but the question about the correctness of the conjecture for all six vertex tournaments remained open. In this paper we prove that all but at most one six vertex tournament satisfy the Erdos-Hajnal conjecture. That reduces the six-vertex case to a single tournament. View details
    Adaptive Anonymization of Data with b-Edge Covers
    Ariful Khan
    Alex Pothen
    Mahantesh Halappanavar
    Antonino Tumeo
    A M Ferdous
    SC (2018)
    Preview abstract We explore the problem of sharing data that pertains to individuals with anonymity guarantees, where each user requires a desired level of privacy. We propose the first shared-memory as well as distributed memory parallel algorithms for the adaptive anonymity problem that achieves this goal, and produces high quality anonymized datasets. The new algorithm is based on an optimization procedure that iteratively computes weights on the edges of a dissimilarity matrix, and at each iteration computes a minimum weighted b-Edge Cover in the graph. We describe how a 2-approximation algorithm for computing the b-Edge Cover can be used to solve the adaptive anonymity problem in parallel. We are able to solve adaptive anonymity problems with hundreds of thousands of instances and hundreds of features on a supercomputer in under five minutes. Our algorithm scales up to 8K cores on a distributed memory supercomputer, while also providing good speedups on shared memory multiprocessors. On smaller problems where an algorithm based on Belief Propagation is feasible, our algorithm is two orders of magnitude faster. View details
    Preview abstract We present a new method of blackbox optimization via gradient approximation with the use of structured random orthogonal matrices, providing more accurate estimators than baselines and with provable theoretical guarantees. We show that this algorithm can be successfully applied to learn better quality compact policies than those using standard gradient estimation techniques. The compact policies we learn have several advantages over unstructured ones, including faster training algorithms and faster inference. These benefits are important when the policy is deployed on real hardware with limited resources. Further, compact policies provide more scalable architectures for derivative-free optimization (DFO) in high-dimensional spaces. We show that most robotics tasks from the OpenAI Gym can be solved using neural networks with less than 300 parameters, with almost linear time complexity of the inference phase, with up to 13x fewer parameters relative to the Evolution Strategies (ES) algorithm introduced by Salimans et al. (2017). We do not need heuristics such as fitness shaping to learn good quality policies, resulting in a simple and theoretically motivated training mechanism. View details
    Preview abstract The Erdos-Hajnal conjecture states that for every given undirected graph H there exists a constant c(H)>0 such that every graph G that does not contain H as an induced subgraph contains a clique or a stable set of size at least |V(G)|^{c(H)}. The conjecture is still open. Its equivalent directed version states that for every given tournament H there exists a constant c(H)>0 such that every H-free tournament T contains a transitive subtournament of order at least |V(T)|^{c(H)}. We prove in this paper that {H1,H2}-free tournaments T contain transitive subtournaments of size at least |V(T)|^{c(H1,H2)} for some c(H1,H2)>0 and several pairs of tournaments: H1, H2. In particular we prove that {H,Hc}-freeness implies existence of the polynomial-size transitive subtournaments for several tournaments H for which the conjecture is still open (Hc stands for the complement of H). To the best of our knowledge these are first nontrivial results of this type. View details
    Excluding hooks and their complements
    Dvir Falik
    Anita Liebenau
    Viresh Patel
    Marcin Pilipczuk
    Electr. J. Comb. (2018)
    Preview abstract The celebrated Erdos-Hajnal conjecture states that for every n-vertex undirected graph H there exists $\eps(H)>0$ such that every graph G that does not contain H as an induced subgraph contains a clique or an independent set of size at least $n^{\eps(H)}$. A weaker version of the conjecture states that the polynomial-size clique/independent set phenomenon occurs if one excludes both H and its complement $H^{\compl}$. We show that the weaker conjecture holds if H is any path with a pendant edge at its third vertex; thus we give a new infinite family of graphs for which the conjecture holds. View details
    Learning-based Air Data System for Safe and Efficient Control of Fixed-Wing Aerial Vehicles
    Brandon Jones
    Damien Jourdan
    Maciej Chociej
    Byron Boots
    THE 16TH IEEE INTERNATIONAL SYMPOSIUM ON SAFETY, SECURITY, AND RESCUE ROBOTICS (SSRR-2018) (to appear)
    Preview abstract We develop an air data system for aerial robots executing high-speed outdoor missions subject to significant aerodynamic forces on their bodies. The system is based on a combination of Extended Kalman Filtering (EKF) and autoregressive feedforward Neural Networks, relying only on IMU sensors and GPS. This eliminates the need to instrument the vehicle with Pitot tubes and mechanical vanes, reducing associated cost, weight, maintenance requirements and likelihood of catastrophic mechanical failures. The system is trained to clone the behaviour of Pitot-tube measurements on thousands of instrumented simulated and real flights, and does not require a vehicle aerodynamics model. We demonstrate that safe guidance and navigation is possible in executing complex maneuvers in the presence of wind gusts without relying on airspeed sensors. We also demonstrate accuracy enhancements from successful “simulation-to-reality” transfer and dataset aggregation techniques to correct for training-test distribution mismatches when the air-data system and the control stack operate in closed loop. View details
    Preview abstract Learning to predict complex time-series data is a fundamental challenge in a range of disciplines including Machine Learning, Robotics, and Natural Language Processing. Predictive State Recurrent Neural Networks (PSRNNs) Downey et al. (2017) are a state-of-the-art approach for modeling time-series data which combine the benefits of probabilistic filters and Recurrent Neural Networks in a single model. PSRNNs leverage the concept of Hilbert Space Embeddings of distributions Smola et al. (2007) to embed predictive states into a Reproducing Kernel Hilbert Space, then estimate, predict, and update these embedded states using Kernel Bayes Rule. Practical implementations of PSRNNs are made possible by the machinery of Random Features, where input features are mapped into a new space where dot products approximate the kernel well. Unfortunately it turns out that PSRNNs often require a large number of RFs to obtain good results, resulting in large models which are slow to execute and slow to train. Orthogonal Random Features (ORFs)Yu et al. (2016) is an improvement on RFs which has been shown to decrease the number of RFs required in a number of applications. Unfortunately it is not clear that ORFs can be applied to PSRNNs, as PSRNNs rely on Kernel Ridge Regression as a core component of their learning algorithm, and the theoretical guarantees of ORF do not apply in this setting. In this paper we extend the theory of ORFs to Kernel Ridge Regression and show that ORFs can be used to obtain Orthogonal PSRNNs (OPSRNNs), which are smaller and faster than PSRNNs. View details
    Geometrically Coupled Monte Carlo Sampling
    Mark Rowland
    François Chalus
    Aldo Pacchiano
    Richard E. Turner
    Adrian Weller
    Advances in Neural Information Processing Systems 31 (NIPS 2018)
    Preview abstract Monte Carlo sampling in high-dimensional, low-sample settings is important in many machine learning tasks. We improve current methods for sampling in Euclidean spaces by avoiding independence, and instead consider ways to couple samples. We show fundamental connections to optimal transport theory, leading to novel sampling algorithms, and providing new theoretical grounding for existing strategies. We compare our new strategies against prior methods for improving sample efficiency, including QMC, by studying discrepancy. We explore our findings empirically, and observe benefits of our sampling schemes for reinforcement learning and generative modelling. View details
    Preview abstract We propose a simple drop-in noise-tolerant replacement for the standard finite difference procedure used ubiquitously in blackbox optimization. In our approach, parameter perturbation directions are defined by a family of deterministic or randomized structured matrices. We show that at the small cost of computing a Fast Fourier Transform (FFT), such structured finite differences consistently give higher quality approximation of gradients and Jacobians in comparison to vanilla approaches that use coordinate directions or random Gaussian perturbations. We show that linearization of noisy, blackbox dynamics using our methods leads to improved performance of trajectory optimizers like Iterative LQR and Differential Dynamic Programming on several classic continuous control tasks. By embedding structured exploration in implicit filtering methods, we are able to learn agile walking and turning policies for quadruped locomotion, that successfully transfer from simulation to actual hardware. We give a theoretical justification of our methods in terms of bounds on the quality of gradient reconstruction in the presence of noise. View details
    VisualBackProp: efficient visualization of CNNs
    Mariusz Bojarski
    Anna Choromanska
    Bernard Firner
    Larry Jackel
    Urs Muller
    Karol Zieba
    ICRA (2018)
    Preview abstract This paper proposes a new method, that we call VisualBackProp, for visualizing which sets of pixels of the input image contribute most to the predictions made by the convolutional neural network (CNN). The method heavily hinges on exploring the intuition that the feature maps contain less and less irrelevant information to the prediction decision when moving deeper in the network. The technique we propose was developed as a debugging tool for CNN-based systems for steering self-driving cars and is therefore required to run in real-time, i.e. the proposed method was designed to require less computations then single forward propagation per image. This makes the presented visualization method a valuable debugging tool which can be easily used during training or inference. We furthermore justify our approach with theoretical argument and theoretically confirm that the proposed method identifies sets of input pixels, rather than individual pixels, that collaboratively con- tribute to the prediction. Our theoretical findings stand in agreement with experimental results. The empirical evaluation shows the plausibility of the proposed approach on road data. View details
    The Geometry of Random Features
    Mark Rowland
    Richard Turner
    Adrian Weller
    International Conference on Artificial Intelligence and Statistics (AISTATS) (2018)
    Preview abstract We present an in-depth examination of the effectiveness of radial basis function kernel (beyond Gaussian) estimators based on orthogonal random feature maps. We show that orthogonal estimators outperform state-of-the-art mechanisms that use iid sampling under weak conditions for tails of the associated Fourier distributions. We prove that for the case of many dimensions, the superiority of the orthogonal transform over iid methods can be accurately measured by a property we define called the charm of the kernel, and that orthogonal random features provide optimal kernel estimators. Furthermore, we provide the first theoretical results which explain why orthogonal random features outperform unstructured on downstream tasks such as kernel ridge regression by showing that orthogonal random features provide kernel algorithms with better spectral properties than the previous state-of-the-art. Our results enable practitioners more generally to estimate the benefits from applying orthogonal transforms. View details
    Graph sketching-based Space-efficient Data Clustering
    Anne Morvan
    Cedric Gouy-Pailler
    Jamal Atif
    SIAM International Conference on DATA MINING (SDM) (2018)
    Preview abstract In this paper, we address the problem of recovering arbitrary-shaped data clusters from datasets while facing high space constraints, as this is for instance the case in many real-world applications when analysis algorithms are directly deployed on resources-limited mobile devices collecting the data. We present DBMSTClu a new space-efficient density-based non-parametric method working on a Minimum Spanning Tree (MST) recovered from a limited number of linear measurements i.e. a sketched version of the dissimilarity graph G between the N objects to cluster. Unlike k-means, k-medians or k-medoids algorithms, it does not fail at distinguishing clusters with particular forms thanks to the property of the MST for expressing the underlying structure of a graph. No input parameter is needed contrarily to DBSCAN or the Spectral Clustering method. An approximate MST is retrieved by following the dynamic semi-streaming model in handling the dissimilarity graph G as a stream of edge weight updates which is sketched in one pass over the data into a compact structure requiring O(N polylog(N)) space, far better than the theoretical memory cost O(N^{2}) of G. The recovered approximate MST T as input, DBMSTClu then successfully detects the right number of non-convex clusters by performing relevant cuts on T in a time linear in N. We provide theoretical guarantees on the quality of the clustering partition and also demonstrate its advantage over the existing state-of-the-art on several datasets. View details
    Preview abstract We examine a class of embeddings based on structured random matrices with orthogonal rows which can be applied in many machine learning applications including dimensionality reduction and kernel approximation. For both the Johnson-Lindenstrauss transform and the angular kernel, we show that we can select matrices yielding guaranteed improved performance in accuracy and/or speed compared to earlier methods. We introduce matrices with complex entries which give significant further accuracy improvement. We provide geometric and Markov chain-based perspectives to help understand the benefits, and empirical results which suggest that the approach is helpful in a wider range of applications. View details
    Preview abstract From a small number of calls to a given “blackbox" on random input perturbations, we show how to efficiently recover its unknown Jacobian, or estimate the left action of its Jacobian on a given vector. Our methods are based on a novel combination of compressed sensing and graph coloring techniques, and provably exploit structural prior knowledge about the Jacobian such as sparsity and symmetry while being noise robust. We demonstrate efficient backpropagation through noisy blackbox layers in a deep neural net, improved data-efficiency in the task of linearizing the dynamics of a rigid body system, and the generic ability to handle a rich class of input-output dependency structures in Jacobian estimation problems. View details
    Structured adaptive and random spinners for fast machine learning computations
    Mariusz Bojarski
    Anna Choromanska
    Francois Fagan
    Cedric Gouy-Pailler
    Anne Morvan
    Nourhan Sakr
    Jamal Atif
    AISTATS (2017)
    Preview abstract We consider an efficient computational framework for speeding up several machine learning algorithms with almost no loss of accuracy. The proposed framework relies on projections via structured matrices that we call Structured Spinners, which are formed as products of three structured matrix-blocks that incorporate rotations. The approach is highly generic, i.e. i) structured matrices under consideration can either be fully-randomized or learned, ii) our structured family contains as special cases all previously considered structured schemes, iii) the setting extends to the non-linear case where the projections are followed by non-linear functions, and iv) the method finds numerous applications including kernel approximations via random feature maps, dimensionality reduction algorithms, new fast cross-polytope LSH techniques, deep learning, convex optimization algorithms via Newton sketches, quantization with random projection trees, and more. The proposed framework comes with theoretical guarantees characterizing the capacity of the structured model in reference to its unstructured counterpart and is based on a general theoretical principle that we describe in the paper. As a consequence of our theoretical analysis, we provide the first theoretical guarantees for one of the most efficient existing LSH algorithms based on the HD3HD2HD1 structured matrix [Andoni et al., 2015]. The exhaustive experimental evaluation confirms the accuracy and efficiency of structured spinners for a variety of different applications. View details
    Manifold Regularization for Kernelized LSTD
    Xinyan Yan
    Byron Boots
    1st Annual Conference on Robot Learning (CoRL 2017) (to appear)
    Preview abstract Policy evaluation or value function approximation is a key procedure in reinforcement learning (RL). It is a necessary component of policy iteration and can be used for variance reduction in policy gradient methods. Therefore, its quality has a significant impact on most RL algorithms. Motivated by manifold regularized learning, we propose a novel kernelized policy evaluation method that takes advantage of the intrinsic geometry of the state space learned from data, in order to achieve better sample efficiency and higher accuracy in action-state value function approximation. Applying the proposed method in the Least-Squares Policy Iteration (LSPI) framework, we observe superior performance compared to widely used parametric basis functions on two standard benchmarks in terms of policy quality. View details
    Preview abstract Policy evaluation or value/Q-function approximation is a key procedure in reinforcement learning (RL). It is a necessary component of policy iteration and can be used for variance reduction in policy gradient methods. Therefore its quality has a significant impact on most RL algorithms. Motivated by manifold regularized learning, we propose a novel kernelized policy evaluation method that takes advantage of the intrinsic geometry of the state space learned from data, in order to achieve better sample efficiency and higher accuracy in Q-function approximation. Applying the proposed method in the Least-Squares Policy Iteration (LSPI) framework, we observe superior performance compared to widely used parametric basis functions on two standard benchmarks in terms of policy quality. View details
    Preview abstract From a small number of calls to a given “blackbox"" on random input perturbations, we show how to efficiently recover its unknown Jacobian, or estimate the left action of its Jacobian on a given vector. Our methods are based on a novel combination of compressed sensing and graph coloring techniques, and provably exploit structural prior knowledge about the Jacobian such as sparsity and symmetry while being noise robust. We demonstrate efficient backpropagation through noisy blackbox layers in a deep neural net, improved data-efficiency in the task of linearizing the dynamics of a rigid body system, and the generic ability to handle a rich class of input-output dependency structures in Jacobian estimation problems. View details
    Quantization based Fast Inner Product Search
    David Simcha
    Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, AISTATS 2016, Cadiz, Spain, May 9-11, 2016, JMLR.org, pp. 482-490
    Preview abstract We propose a quantization based approach for fast approximate Maximum Inner Product Search (MIPS). Each database vector is quantized in multiple subspaces via a set of codebooks, learned directly by minimizing the inner product quantization error. Then, the inner product of a query to a database vector is approximated as the sum of inner products with the subspace quantizers. Different from recently proposed LSH approaches to MIPS, the database vectors and queries do not need to be augmented in a higher dimensional feature space. We also provide a theoretical analysis of the proposed approach, consisting of the concentration results under mild assumptions. Furthermore, if a small sample of example queries is given at the training time, we propose a modified codebook learning procedure which further improves the accuracy. Experimental results on a variety of datasets including those arising from deep neural networks show that the proposed approach significantly outperforms the existing state-of-the-art. View details
    Binary embeddings with structured hashed projections
    Anna Choromanska
    Mariusz Bojarski
    Tony Jebara
    Yann LeCun
    Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, JMLR.org, pp. 344-353
    Preview abstract We consider the hashing mechanism for constructing binary embeddings, that involves pseudo-random projections followed by nonlinear (sign function) mappings. The pseudo-random projection is described by a matrix, where not all entries are independent random variables but instead a fixed "budget of randomness" is distributed across the matrix. Such matrices can be efficiently stored in sub-quadratic or even linear space, provide reduction in randomness usage (i.e. number of required random values), and very often lead to computational speed ups. We prove several theoretical results showing that projections via various structured matrices followed by nonlinear mappings accurately preserve the angular distance between input high-dimensional vectors. To the best of our knowledge, these results are the first that give theoretical ground for the use of general structured matrices in the nonlinear setting. In particular, they generalize previous extensions of the Johnson-Lindenstrauss lemma and prove the plausibility of the approach that was so far only heuristically confirmed for some special structured matrices. Consequently, we show that many structured matrices can be used as an efficient information compression mechanism. Our findings build a better understanding of certain deep architectures, which contain randomly weighted and untrained layers, and yet achieve high performance on different learning tasks. We empirically verify our theoretical findings and show the dependence of learning via structured hashed projections on the performance of neural network as well as nearest neighbor classifier. View details
    Preview abstract We present an intriguing discovery related to Random Fourier Features: in Gaussian kernel approximation, replacing the random Gaussian matrix by a properly scaled random orthogonal matrix significantly decreases kernel approximation error. We call this technique Orthogonal Random Features (ORF), and provide theoretical and empirical justification for this behavior. Motivated by this discovery, we further propose Structured Orthogonal Random Features (SORF), which uses a class of structured discrete orthogonal matrices to speed up the computation. The method reduces the time cost from O(d^2) to O(dlogd), where d is the data dimensionality, with almost no compromise in kernel approximation quality compared to ORF. Experiments on several datasets verify the effectiveness of ORF and SORF over the existing methods. We also provide discussions on using the same type of discrete orthogonal structure for a broader range of applications. View details
    Recycling Randomness with Structure for Sublinear time Kernel Expansions
    Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, JMLR.org, pp. 2502-2510
    Preview abstract We propose a scheme for recycling Gaussian random vectors into structured matrices to approximate various kernel functions in sublinear time via random embeddings. Our framework includes the Fastfood construction of Le et al. (2013) as a special case, but also extends to Circulant, Toeplitz and Hankel matrices, and the broader family of structured matrices that are characterized by the concept of lowdisplacement rank. We introduce notions of coherence and graph-theoretic structural constants that control the approximation quality, and prove unbiasedness and low-variance properties of random feature maps that arise within our framework. For the case of low-displacement matrices, we show how the degree of structure and randomness can be controlled to reduce statistical variance at the cost of increased computation and storage requirements. Empirical results strongly support our theory and justify the use of a broader family of structured matrices for scaling up kernel methods using random features. View details
    Preview abstract We give the first O(1/sqrt{T})-error online algorithm for reconstructing noisy statistical databases, where T is the number of (online) sample queries received. The algorithm is optimal up to the poly(log(T)) factor in terms of the error and requires only O(log T) memory. It aims to learn a hidden database-vector w* in R^d in order to accurately answer a stream of queries regarding the hidden database, which arrive in an online fashion from some unknown distribution D. We assume the distribution D is defined on the neighborhood of a low-dimensional manifold. The presented algorithm runs in O(dD)-time per query, where d is the dimensionality of the query-space. Contrary to the classical setting, there is no separate training set that is used by the algorithm to learn the database —- the stream on which the algorithm will be evaluated must also be used to learn the database-vector. The algorithm only has access to a binary oracle O that answers whether a particular linear function of the database-vector plus random noise is larger than a threshold, which is specified by the algorithm. We note that we allow for a significant O(D) amount of noise to be added while other works focused on the low noise o(sqrt{D}) setting. For a stream of T queries our algorithm achieves an average error O(1/sqrt{T}) by filtering out random noise, adapting threshold values given to the oracle based on its previous answers and, as a consequence, recovering with high precision a projection of a database-vector w* onto the manifold defining the query-space. Our algorithm may be also applied in the adversarial machine learning context to compromise machine learning engines by heavily exploiting the vulnerabilities of the systems that output only binary signal and in the presence of significant noise. View details
    Fast nonlinear embeddings via structured matrices
    Francois Fagan
    CoRR, vol. abs/1604.07356 (2016)
    Differentially-private learning of low dimensional manifolds
    Anna Choromanska
    Geetha Jagannathan
    Claire Monteleoni
    Theor. Comput. Sci., vol. 620 (2016), pp. 91-104
    Binary embeddings with structured hashed projections
    Anna Choromanska
    Mariusz Bojarski
    Tony Jebara
    Yann LeCun
    CoRR, vol. abs/1511.05212 (2015)
    Efficient data hashing with structured binary embeddings
    CoRR, vol. abs/1505.03190 (2015)
    An $\tildeO(\frac1\sqrtT)$-error online algorithm for retrieving heavily perturbated statistical databases in the low-dimensional querying mode
    EH-suprema of tournaments with no nontrivial homogeneous sets
    J. Comb. Theory, Ser. B, vol. 114 (2015), pp. 97-123
    Quantization based Fast Inner Product Search
    David Simcha
    CoRR, vol. abs/1509.01469 (2015)
    Coloring tournaments with forbidden substructures
    Tony Jebara
    CoRR, vol. abs/1504.01119 (2015)
    Learning how to rank from heavily perturbed statistics - digraph clustering approach
    CoRR, vol. abs/1504.01118 (2015)
    Fast Online Clustering with Randomized Skeleton Sets
    Xiaofeng Liu
    CoRR, vol. abs/1506.03425 (2015)
    Forcing large transitive subtournaments
    Eli Berger
    Maria Chudnovsky
    J. Comb. Theory, Ser. B, vol. 112 (2015), pp. 1-17
    Notes on using Determinantal Point Processes for Clustering with Applications to Text Clustering
    Apoorv Agarwal
    Anna Choromanska
    CoRR, vol. abs/1410.6975 (2014)
    Differentially- and non-differentially-private random decision trees
    Mariusz Bojarski
    Anna Choromanska
    Yann LeCun
    CoRR, vol. abs/1410.6973 (2014)
    Tournaments with near-linear transitive subsets
    Maria Chudnovsky
    Paul D. Seymour
    J. Comb. Theory, Ser. B, vol. 109 (2014), pp. 228-249
    Upper Bounds for Erdös-Hajnal Coefficients of Tournaments
    Journal of Graph Theory, vol. 74 (2013), pp. 122-132
    Adaptive Anonymity via b-Matching
    Tony Jebara
    Kui Tang
    NIPS (2013), pp. 3192-3200
    Differentially-Private Learning of Low Dimensional Manifolds
    Anna Choromanska
    Geetha Jagannathan
    Claire Monteleoni
    ALT (2013), pp. 249-263
    Tournaments and colouring
    Eli Berger
    Maria Chudnovsky
    Jacob Fox
    Martin Loebl
    Alex Scott
    Paul D. Seymour
    Stéphan Thomassé
    J. Comb. Theory, Ser. B, vol. 103 (2013), pp. 1-20
    The power of the dinur-nissim algorithm: breaking privacy of statistical and graph databases
    Tal Malkin
    PODS (2012), pp. 65-76