Jump to Content

Manzil Zaheer

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    Preview abstract Many modern high-performing machine learning models such as GPT-3 primarily rely on scaling up models, e.g., transformer networks. Simultaneously, a parallel line of work aims to improve the model performance by augmenting an input instance with other (labeled) instances during inference. Examples of such augmentations include task-specific prompts and similar examples retrieved from the training data by a nonparametric component. Remarkably, retrieval-based methods have enjoyed success on a wide range of problems, ranging from standard natural language processing and vision tasks to protein folding, as demonstrated by many recent efforts, including WebGPT and AlphaFold. Despite a growing literature showcasing the promise of these models, the theoretical underpinning for such models remains underexplored. In this paper, we present a formal treatment of retrieval-based models to characterize their generalization ability. In particular, we focus on two classes of retrieval-based classification approaches: First, we analyze a local learning framework that employs an explicit local empirical risk minimization based on retrieved examples for each input instance. Interestingly, we show that breaking down the underlying learning task into local sub-tasks enables the model to employ a low complexity parametric component to ensure good overall accuracy. The second class of retrieval-based approaches we explore learns a global model using kernel methods to directly map an input instance and retrieved examples to a prediction, without explicitly solving a local learning task. View details
    Teacher Guided Training: An Efficient Framework for Knowledge Transfer
    Chong You
    Himanshu Jain
    Rob Fergus
    International Conference on Learning Representations (2023) (to appear)
    Preview abstract The remarkable performance gains realized by large pretrained models, e.g., GPT-3, hinge on the massive amounts of data they are exposed to during training. Analogously, distilling such large models to compact models for efficient deployment also necessitates a large amount of (labeled or unlabeled) training data. In this paper, we devise teacher-guided training (TGT) framework for training a high-quality compact model that leverages the knowledge acquired by pre-trained \emph{generative} models while obviating the need to go through a large volume of data. TGT exploits the fact that the teacher has acquired a good representation of the underlying data domain, which typically corresponds to a much lower dimensional manifold than the ambient space. Furthermore, we can use the teacher to explore the instance space more efficiently through sampling or gradient-based methods; thus, making TGT especially attractive for limited data or long-tail settings. We formally capture this benefit of proposed data-domain exploration in our generalization bounds. Among our empirical evaluations, we find that TGT can improve accuracy on ImageNet-LT by 10% compared to natural baseline and match accuracy on sentiment analysis on Amazon reviews without the need for pretraining. View details
    Preview abstract When writing programs, people have the ability to tackle a new complex task by decomposing it into smaller and more familiar subtasks. While it is difficult to measure whether neural program synthesis methods have similar capabilities, what we can measure is whether they compositionally generalize, that is, whether a model that has been trained on the simpler subtasks is subsequently able to solve more complex tasks. In this paper, we focus on measuring the ability of learned program synthesizers to compositionally generalize. We first characterize several different axes along which program synthesis methods would be desired to generalize, e.g., length generalization, or the ability to combine known subroutines in new ways that do not occur in the training data. Based on this characterization, we introduce a benchmark suite of tasks to assess these abilities based on two popular existing datasets, SCAN and RobustFill. Finally, we make first attempts to improve the compositional generalization ability of Transformer models along these axes through novel attention mechanisms that draw inspiration from a human-like decomposition strategy. Empirically, we find our modified Transformer models generally perform better than natural baselines, but the tasks remain challenging. View details
    A Fourier Approach to Mixture Learning
    Mingda Qiao
    Guru Prashanth Guruganesh
    Avinava Dubey
    Conference on Neural Information Processing Systems (2022) (to appear)
    Preview abstract We revisit the problem of learning mixtures of spherical Gaussians. Given samples from mixture $\frac{1}{k}\sum_{j=1}^{k}\N(\mu_j, I_d)$, the goal is to estimate the means $\mu_1, \mu_2, \ldots, \mu_k \in \R^d$ up to a small error. The hardness of this learning problem can be measured by the \emph{separation} $\Delta$ defined as the minimum distance between all pairs of means. Regev and Vijayaraghavan (2017) showed that with $\Delta = \Omega(\sqrt{\log k})$ separation, the means can be learned using $\poly(k, d)$ samples, whereas super-polynomially many samples are required if $\Delta = o(\sqrt{\log k})$ and $d = \Omega(\log k)$. This leaves open the low-dimensional regime where $d = o(\log k)$. In this work, we give an algorithm that efficiently learns the means in $d = O(\log k/\log\log k)$ dimensions under separation $d/\sqrt{\log k}$ (modulo doubly logarithmic factors). This separation is strictly smaller than $\sqrt{\log k}$, and is also shown to be necessary. Along with the results of Regev and Vijayaraghavan (2017), our work almost pins down the critical separation threshold at which efficient parameter learning becomes possible for spherical Gaussian mixtures. This was previously open even in one dimension. More generally, our algorithm runs in time $\poly(k)\cdot f(d, \Delta, \eps)$, and is thus fixed-parameter tractable in parameters $d$, $\Delta$ and $\eps$. Our approach is based on estimating the Fourier transform of the mixture at carefully chosen frequencies, and both the algorithm and its analysis are simple and elementary. Our positive results can be easily extended to learning mixtures of non-Gaussian distributions, under a mild condition on the Fourier spectrum of the distribution. View details
    A Context Integrated Transformer-based Neural Network for Auction Design
    Zhijian Duan
    Jingwu Tang
    Yutong Yin
    Xiang Yan
    Xiaotie Deng
    The Thirty-ninth International Conference on Machine Learning (ICML'22) (2022)
    Preview abstract One of the central problems in auction design is to develop an incentive compatible mechanism that maximizes the expected revenue. While theoretical approaches have encountered bottlenecks for multi-item auctions, recently there are many progresses of finding optimal auction through deep learning. However, such works either focus on a fixed set of bidders and items, or restrict the auction to be symmetric. In this work, we overcome this limitation by factoring \emph{public} contextual information of bidders and items into deep learning framework. We propose $\mathtt{CITransNet}$, a context integrated transformer-based neural network for contextual auction design, which maintains permutation-equivariance over bids while being able to handle asymmetric contextual information in auctions. We show by extensive experiments that $\mathtt{CITransNet}$ can recover the known optimal analytical solutions, obtain novel mechanisms for complex multi-item auctions, and generalize to settings different from training set. View details
    Thompson Sampling with a Mixture Prior
    Joey Hong
    Branislav Kveton
    Mohammad Ghavamzadeh
    Proceedings of The 25th International Conference on Artificial Intelligence and Statistics (AI-Stats-22) (2022), pp. 7565-7586
    Preview abstract We consider posterior sampling in online decision-making problems where the uncertain environment is sampled from a mixture distribution. We incorporate this structure in a natural way by initializing a Thompson sampling algorithm with a mixture prior. We provide a novel, general outline for analyzing the regret of Thompson sampling with a mixture prior. We also use this to derive Bayes regret bounds in both a linear bandit and tabular MDP settings. The regret bounds depend on the confidence widths of each component of the mixture prior, and converge to solely identifying the correct component when confidence widths are small. Finally, we demonstrate the empirical effectiveness of our proposed algorithm in a synthetic and real-world bandit problem involving multi-task image classification. View details
    A Field Guide to Federated Optimization
    Jianyu Wang
    Gauri Joshi
    Maruan Al-Shedivat
    Galen Andrew
    A. Salman Avestimehr
    Katharine Daly
    Deepesh Data
    Suhas Diggavi
    Hubert Eichner
    Advait Gadhikar
    Antonious M. Girgis
    Filip Hanzely
    Chaoyang He
    Samuel Horvath
    Martin Jaggi
    Tara Javidi
    Sai Praneeth Karimireddy
    Jakub Konečný
    Sanmi Koyejo
    Tian Li
    Peter Richtarik
    Virginia Smith
    Mahdi Soltanolkotabi
    Weikang Song
    Sebastian Stich
    Ameet Talwalkar
    Hongyi Wang
    Blake Woodworth
    Honglin Yuan
    Mi Zhang
    Tong Zhang
    Chunxiang (Jake) Zheng
    Chen Zhu
    arxiv (2021)
    Preview abstract Federated learning and analytics are a distributed approach for collaboratively learning models (or statistics) from decentralized data, motivated by and designed for privacy protection. The distributed learning process can be formulated as solving federated optimization problems, which emphasize communication efficiency, data heterogeneity, compatibility with privacy and system requirements, and other constraints that are not primary considerations in other problem settings. This paper provides recommendations and guidelines on formulating, designing, evaluating and analyzing federated optimization algorithms through concrete examples and practical implementation, with a focus on conducting effective simulations to infer real-world performance. The goal of this work is not to survey the current literature, but to inspire researchers and practitioners to design federated learning algorithms that can be used in various practical applications. View details
    Scalable Hierarchical Agglomerative Clustering
    Nick Monath
    Avinava Dubey
    Guru Prashanth Guruganesh
    Andrew McCallum
    Gokhan Mergen
    Mert Terzihan
    Bryon Tjanaka
    Yuchen Wu
    Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (2021), 1245–1255
    Preview abstract The applicability of agglomerative clustering, for inferring both hierarchical and flat clustering, is limited by its scalability. Existing scalable hierarchical clustering methods sacrifice quality for speed and often lead to over-merging of clusters. In this paper, we present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points. We perform a detailed theoretical analysis, showing that under mild separability conditions our algorithm can not only recover the optimal flat partition but also provide a two-approximation to non-parametric DP-Means objective [32]. This introduces a novel application of hierarchical clustering as an approximation algorithm for the non-parametric clustering objective. We additionally relate our algorithm to the classic hierarchical agglomerative clustering method. We perform extensive empirical experiments in both hierarchical and flat clustering settings and show that our proposed approach achieves state-of-the-art results on publicly available clustering benchmarks. Finally, we demonstrate our method’s scalability by applying it to a dataset of 30 billion queries. Human evaluation of the discovered clusters show that our method finds better quality of clusters than the current state-of-the-art. View details
    Meta-Thompson Sampling
    Branislav Kveton
    Michael Konobeev
    Martin Mladenov
    Proceedings of the 38th International Conference on Machine Learning (ICML 2021), pp. 5884-5893
    Preview abstract Efficient exploration in multi-armed bandits is a fundamental online learning problem. In this work, we propose a variant of Thompson sampling that learns to explore over time by interacting with problem instances sampled from an unknown prior distribution. This algorithm meta-learns the prior and therefore we call it Meta-TS. We propose efficient implementations of Meta-TS and analyze it in Gaussian bandits. Our analysis captures the improvement due to learning the prior and is of a broader interest, because we derive the first prior-dependent upper bound on the Bayes regret. Our regret bound is complemented by empirical evaluation, which shows that Meta-TS quickly adapts to the unknown prior. View details
    Non-Stationary Off-policy Optimization
    Joey Hong
    Branislav Kveton
    International Conference on Artificial Intelligence and Statistics (AISTATS) (2021)
    Preview abstract Off-policy learning is a framework for estimating the value of and optimizing policies offline from logged data without deploying them. Real-world environments are nonstationary, and the optimized policies should be able to adapt to these changes. To address this challenge, we study the novel problem of off-policy optimization in piecewise-stationary environments. Our key idea is to use a change-point detector to partition the logged data into categorical latent states, then find a near-optimal policy conditioned on latent state. We derive high-probability bounds on our off-policy estimates and optimization. Furthermore, we also propose a practical approach to deploy our policy online and evaluate our approach comprehensively on a real-world clickstream dataset. View details
    DAG-structured Clustering by Nearest-Neighbors
    Nicholas Monath
    Avinava Dubey
    Andrew McCallum
    International Conference on Artificial Intelligence and Statistics (2021)
    Preview abstract Hierarchical clusterings compactly encode multiple granularities of clusters within a tree structure. Hierarchies, by definition, fail to capture different flat partitions that are not subsumed in one another. In this paper, we advocate for an alternative structure for representing multiple alternative clusterings, a directed acyclic graph (DAG). By allowing nodes to have multiple parents, DAG structure is not only more flexible than a tree but also allows for points to be members of multiple clusters. We describe a large-scale, map-reduce-based algorithm to infer these DAGs. Our algorithm works by simply merging nearest neighbor substructures to form a DAG structure. Our algorithm is supported with theoretical guarantees showing its representational capacity over tree-based algorithms. Further, we provide comprehensive empirical experiments on large-scale clustering benchmarks and entity resolution datasets. Our results show that our method is as scalable as and more accurate than state-of-the-art tree-based techniques. View details
    Exact and Approximate Hierarchical Clustering Using A*
    Craig Greenberg
    Sebastian Macaluso
    Nicholas Monath
    Avinava Dubey
    Patrick Flaherty
    Kyle Cranmer
    Andrew McCallum
    Uncertainty in Artificial Intelligence (2021)
    Preview abstract Hierarchical clustering is known to be broadly applicable in myriad domains. Despite its extensive use, existing approximate inference methods are insufficient for applications that require either exact or high-quality approximate solutions.For example, in high energy physics, we are interested in discovering high-quality jet structures, which are hierarchical clusterings of particles.In this paper we view inference as a search problem and focus on inferring high-quality hierarchies for a given cost function, rather than using ad hoc (e.g., greedy or beam) methods. This leads naturally to the use of A*, which has seldom been used for clustering (with the notable exception of \citep{daume2007fast}). Unlike ad hoc search methods, A* carries with it optimality guarantees. However, applying A* search naively leads to a large space and time complexity. To address this challenge, we develop a novel augmented trellis data structure and dynamic programming algorithm for A* that result in substantially improved time and space complexity bounds while still computing the globally optimal hierarchical clustering. We demonstrate that our proposed method increases the number of points for which an exact solution can be found by 25$\%$ compared with previous work \cite{greenberg2020compact}. Furthermore, our approach yields a natural approximation that scales to larger datasets and achieves substantially higher quality results than ad hoc search baselines, motivating its use in applications demanding exact or high-quality approximations. View details
    Latent Programmer: Discrete Latent Codes for Program Synthesis
    Joey Hong
    David Martin Dohan
    Rishabh Singh
    International Conference on Machine Learning (ICML) (2021)
    Preview abstract In many sequence learning tasks, such as program synthesis and document summarization, a key problem is searching over a large space of possible output sequences. We propose to learn representations of the outputs that is specifically meant for search: rich enough to specify the desired output but compact enough to make search more efficient. An appealing realization of such representation are discrete latent codes, as this naturally allows sophisticated combinatorial search strategies. The latent codes are learned using a self-supervised learning principle, in which first a discrete autoencoder is trained on the output sequences, and then the resulting latent codes are used as intermediate targets for the end-to-end sequence prediction task. Based on these insights, we introduce the Latent Programmer, a program synthesis method that first predicts a discrete latent codes from input/output examples, and then generates the program in the target language. We evaluate the Latent Programmer on two domains: synthesis of string transformation programs, and generation of programs from natural language descriptions. We demonstrate that the discrete latent representation significantly improves synthesis accuracy. View details
    Preview abstract Federated learning is a distributed machine learning paradigm in which a large number of clients coordinate with a central server to learn a model without sharing their own training data. Due to the heterogeneity of the client datasets, standard federated optimization methods such as Federated Averaging (FedAvg) are often difficult to tune and exhibit unfavorable convergence behavior. In non-federated settings, adaptive optimization methods have had notable success in combating such issues. In this work, we propose federated versions of adaptive optimizers, including Adagrad, Yogi and Adam, and analyze their convergence in the presence of heterogeneous data for general nonconvex settings. Our results highlight the interplay between client heterogeneity and communication efficiency. We also perform extensive experiments on these methods and show that the use of adaptive optimizers can improve the performance of federated learning. View details
    Big Bird: Transformers for Longer Sequences
    Guru Prashanth Guruganesh
    Avinava Dubey
    Joshua Ainslie
    Santiago Ontanon
    Anirudh Ravula
    Qifan Wang
    NeurIPS (2020)
    Preview abstract Transformers-based models, such as BERT, have been one of the most successful deep learning models for NLP. Unfortunately, one of their core limitations is the quadratic dependency (in terms of memory mainly) on the sequence length due to their full attention mechanism. To remedy this, we propose, \emph{BigBird}, a sparse attention mechanism that reduces this quadratic dependency to linear. We show that \emph{BigBird} is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model. Along the way, our theoretical analysis demonstrates the need for having an O(1) global tokens, such as CLS, that attend to the entire sequence as part of the sparse attentions. We show that the proposed sparse attention can handle sequences of length up to 8x of what was previously possible using similar hardware. As a consequence of the capability to handle longer context, \emph{BigBird} drastically improves performance on various NLP tasks such as question answering. View details
    Differentiable Meta-Learning of Bandit Policies
    Branislav Kveton
    Martin Mladenov
    Advances in Neural Information Processing Systems 33 (NeurIPS 2020), pp. 2122-2134
    Preview abstract Exploration policies in Bayesian bandits maximize the average reward over problem instances drawn from some distribution P. In this work, we learn such policies for an unknown distribution P using samples from P. Our approach is a form of meta-learning and exploits properties of P without making strong assumptions about its form. To do this, we parameterize our policies in a differentiable way and optimize them by policy gradients, an approach that is pleasantly general and easy to implement. We derive effective gradient estimators and propose novel variance reduction techniques. We also analyze and experiment with various bandit policy classes, including neural networks and a novel softmax policy. The latter has regret guarantees and is a natural starting point for our optimization. Our experiments show the versatility of our approach. We also observe that neural network policies can learn implicit biases expressed only through the sampled instances. View details
    Randomized Exploration in Generalized Linear Bandits
    Branislav Kveton
    Lihong Li
    Mohammad Ghavamzadeh
    23rd International Conference on Artificial Intelligence and Statistics (2020)
    Preview abstract We study two randomized algorithms for generalized linear bandits. The first, GLM-TSL, samples a generalized linear model (GLM) from the Laplace approximation to the posterior distribution. The second, GLM-FPL, fits a GLM to a randomly perturbed history of past rewards. We analyze both algorithms and derive $\tilde{O}(d \sqrt{n \log K})$ upper bounds on their $n$-round regret, where $d$ is the number of features and $K$ is the number of arms. The former improves on prior work while the latter is the first for Gaussian noise perturbations in non-linear models. We empirically evaluate both GLM-TSL and GLM-FPL in logistic bandits, and apply GLM-FPL to neural network bandits. Our work showcases the role of randomization, beyond posterior sampling, in exploration. View details
    Latent Bandits Revisited
    Joey Hong
    Branislav Kveton
    Advances in Neural Information Processing Systems 33 (NeurIPS 2020), pp. 13423-13433
    Preview abstract A latent bandit is a bandit problem where the learning agent knows reward distributions of arms conditioned on an unknown discrete latent state. The goal of the agent is to identify the latent state, after which it can act optimally. This setting is a natural midpoint between online and offline learning, where complex models can be learned offline and the agent identifies the latent state online. This is of high practical relevance, for instance in recommender systems. In this work, we propose general algorithms for latent bandits, based on both upper confidence bounds and Thompson sampling. The algorithms are contextual, and aware of model uncertainty and misspecification. We provide a unified theoretical analysis of our algorithms, which have lower regret than classic bandit policies when the number of latent states is smaller than actions. A comprehensive empirical study showcases the advantages of our approach. View details
    Preview abstract Large Transformer models have achieved impressive performance in many natural language tasks. In particular, Transformer based language models have been shown to have great capabilities in encoding factual knowledge in their vast amount of parameters. While the tasks of improving the memorization and generalization of Transformers have been widely studied, it is not well known how to make transformers forget specific old facts and memorize new ones. In this paper, we propose a new task of \emph{explicitly modifying specific factual knowledge in Transformer models while ensuring the model performance does not degrade on the unmodified facts}. This task is useful in many scenarios, such as updating stale knowledge, protecting privacy, and eliminating unintended biases stored in the models. We benchmarked several approaches that provide natural baseline performances on this task. This leads to the discovery of key components of a Transformer model that are especially effective for knowledge modifications. The work also provides insights into the role that different training phases (such as pretraining and fine-tuning) play towards memorization and knowledge modification. View details
    DIFFERENTIABLE MULTI-HOP REASONING OVER A VIRTUAL KNOWLEDGE BASE
    Bhuwan Dhingra
    Graham Neubig
    Ruslan Salakhutdinov
    Vidhisha Balachandran
    ICLR (2020) (to appear)
    Preview abstract We wish to put forward an approach for accessing text as a knowledge base which is useful for question-answering (QA). This approach relies centrally on development of a differentiable operator which allows us to traverse textual data like a ``virtual'' KB. The core of the approach is a neural module that inputs and outputs sets of entities: in particular, this module uses maximum inner product search (MIPS) on a special index to map a set of entities $X$ to all entities $Y$ related to something in $X$ (by some specified relations), as witnessed by some text in the corpus. For multi-hop questions, the set of output entities $Y$ can be again used recursively as the input to a second copy of the module, enabling us to answer complex questions. This module is differentiable, so the full system can be trained completely end-to-end using gradient based methods. Thus, we name it DrKIT: Differentiable Reasoning over a virtual Knowledge base of Indexed Text. We describe a pretraining scheme for the index mention encoder by generating hard negative examples using existing knowledge bases, and we show that DrKIT improves accuracy by $9$ points on 3-hop questions in the MetaQA dataset, cutting the gap between text-based and KB-based methods by $70\%$. DrKIT is also very efficient, processing 10x more queries per second than existing state-of-the-art multi-hop QA systems. View details
    Preview abstract Learning continuous representations of discrete objects such as text, sentences, users, and movies lies at the heart of many applications including involving text and user modeling. Unfortunately, traditional methods that embed all objects do not scale to large vocabulary sizes and embedding dimensions. In this paper, we propose a general method, Anchor & Transform (ANT) that learns sparse representations of discrete objects by jointly learning a small set of \textit{anchor embeddings} and a \textit{sparse transformation} from anchor objects to all objects. ANT is scalable, flexible, end-to-end trainable, and allows the user to easily incorporate domain knowledge about object relationships. ANT also recovers several task-specific baselines under certain structural assumptions on the anchor embeddings and transformation matrices. On several benchmarks involving text and user modeling, ANT demonstrates strong performance with respect to accuracy and sparsity. View details
    Gradient-based Hierarchical Clustering using Continuous Representations of Trees in Hyperbolic Space
    Nick Monath
    Daniel Silva
    Andrew McCallum
    The 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’19) (2019)
    Preview abstract Hierarchical clustering is typically performed using algorithmic-based optimization searching over the discrete space of trees. While these optimization methods are often effective, their discreteness restricts them from many of the benefits of their continuous counterparts, such as scalable stochastic optimization and the joint optimization of multiple objectives or components of a model (e.g.end-to-end training). In this paper, we present an approach for hierarchical clustering that searches over continuous representations of trees in hyperbolic space by running gradient descent. We compactly represent uncertainty over tree structures with vectors in the Poincaré ball. We show how the vectors can be optimized using an objective related to recently proposed cost functions for hierarchical clustering. Using our method with a mini-batch stochastic gradient descent inference procedure, we are able to outperform prior work on clustering millions of ImageNet images by 15 points of dendrogram purity. Further, our continuous tree representation can be jointly optimized in multi-task learning applications offering a 9 point improvement over baseline methods View details
    Towards Gradient Free and Projection Free Stochastic Optimization
    Anit Kumar Sahu
    Soummya Kar
    AISTATS in Proceedings of Machine Learning Research (2019)
    Preview abstract This paper focuses on the problem of constrained stochastic optimization. A zeroth order Frank-Wolfe algorithm is proposed, which in addition to the projection-free nature of the vanilla Frank-Wolfe algorithm makes it gradient free. Under convexity and smoothness assumption, we show that the proposed algorithm converges to the optimal objective function at a rate O(1/T^1/3), where T denotes the iteration count. In particular, the primal sub-optimality gap is shown to have a dimension dependence of O(d^1/3), which is the best known dimension dependence among all zeroth order optimization algorithms with one directional derivative per iteration. For non-convex functions, we obtain the Frank-Wolfe gap to be O(d^1/3 / T^−1/4). Experiments on black-box optimization setups demonstrate the efficacy of the proposed algorithm. View details
    Uncovering Hidden Structure in Sequence Data via Threading Recurrent Models
    Daniel Silva
    Yuchen Wu
    Shibani Sanan
    Surojit Chatterjee
    Proceedings of the 12 ACM International Conference on Web Search and Data Mining (2019), pp. 186-194
    Preview abstract Long Short-Term Memory (LSTM) is one of the most powerful sequence models for user browsing history [17, 22] or natural language text [19]. Despite the strong performance, it has not gained popularity for user-facing applications, mainly owing to a large number of parameters and lack of interpretability. Recently Zaheer et al. [25] introduced latent LSTM Allocation (LLA) to address these problems by incorporating topic models with LSTM, where the topic model maps observed words in each sequence to topics that evolve using an LSTM model. In our experiments, we found the resulting model, although powerful and interpretable, to show shortcomings when applied to sequence data that exhibit multi-modes of behaviors with abrupt dynamic changes. To address this problem we introduce thLLA: a threading LLA model. thLLA has the ability to break each sequence into a set of segments and then model the dynamic in each segment using an LSTM mixture. In that way, thLLA can model abrupt changes in sequence dynamics and provides a better fit for sequence data while still being interpretable and requiring fewer parameters. In addition, thLLA uncovers hidden themes in the data via its dynamic mixture components. However, such generalization and interpretability come at a cost of complex dependence structure, for which inference would be extremely non-trivial. To remedy this, we present an efficient sampler based on particle MCMC method for inference that can draw from the joint posterior directly. Experimental results confirm the superiority of thLLA and the stability of the new inference algorithm on a variety of domains. View details
    Preview abstract This paper introduces a new framework for open-domain question answering in which the retriever and the reader iteratively interact with each other. The framework is agnostic to the architecture of the machine reading model provided it has access to the token-level hidden representations of the reader. The retriever uses fast nearest neighbor search algorithms that allow it to scale to corpora containing millions of paragraphs. A gated recurrent unit updates the query at each step conditioned on the “state” of the reader and the “reformulated” query is used to re-rank the paragraphs by the retriever. We show the efficacy of our architecture by achieving state-of-the-art results (9.5% relative increase) on TriviaQA-unfiltered and we achieve competitive performance on other large open domain datasets such as QUASAR-T, SEARCHQA, and SQUAD-open. We conduct analysis and show that iterative interaction helps in retrieving useful paragraphs from the corpus. Finally, we show that our multi-step-reasoning framework brings uniform improvements when applied to two widely used reader architectures – Dr.QA and BiDAF. View details
    Preview abstract Adaptive gradient methods that rely on scaling gradients down by the square root of exponential moving averages of past squared gradients, such RMSPROP, ADAM, ADADELTA have found wide application in optimizing the non-convex problems that arise in deep learning. However, it has been recently demonstrated that such methods can fail to converge even in simple convex optimization settings. In this work, we provide a new analysis of such methods applied to nonconvex stochastic optimization problems, characterizing the effect of increasing minibatch size. Our analysis shows that under this scenario such methods do converge to stationarity up to the statistical limit of variance in the stochastic gradients (scaled by a constant factor). In particular, our result implies that increasing minibatch sizes enables convergence, thus providing a way to circumvent the non-convergence issues. Furthermore, we provide a new adaptive optimization algorithm, YOGI, which controls the increase in effective learning rate, leading to even better performance with similar theoretical guarantees on convergence. Extensive experiments show that YOGI with very little hyperparameter tuning outperforms methods such as ADAM in several challenging machine learning tasks View details
    State Space LSTM Models with Particle MCMC Inference
    Xun Zheng
    Alex J. Smola
    Eric Xing
    The Thirty-first Annual Conference on Neural Information Processing Systems (NIPS) workshop on Bayesian Deep Learning. (2017)
    Preview abstract Long Short-Term Memory (LSTM) is one of the most powerful sequence models. Despite the strong performance, however, it lacks the nice interpretability as in state space models. In this paper, we present a way to combine the best of both worlds by introducing State Space LSTM (SSL) models that generalizes the earlier work Zaheer et al. (2017) of combining topic models with LSTM. However, unlike Zaheer et al. (2017), we do not make any factorization assumptions in our inference algorithm. We present an efficient sampler based on sequential Monte Carlo (SMC) method that draws from the joint posterior directly. Experimental results confirms the superiority and stability of this SMC inference algorithm on a variety of domains. View details
    Preview abstract Recurrent neural network, such as Long-short term memory (LSTM), are powerful tools for modeling sequential data, however, they lack interpretability and requires large num- ber of parameters. On the other hand, topic models, such as Latent Dirichlet Allocation (LDA), are powerful tools for uncovering the hidden structure in a document collection, however, they lack the same strong predictive power as deep models. In this paper we bridge the gap between such mod- els and propose Latent LSTM Allocation (LLA). In LLA each document is modeled as a sequence of words, and the model jointly groups words into topics and learns the tempo- ral dynamics over the sequence. Our model is interpretable, concise and can capture intricate dynamics. We give an ef- ficient MCMC-EM inference algorithm for our model that scales to millions of documents. Our experimental evalu- ations shows that the proposed model compares favorably with several state-of-the-art baselines. View details
    Canopy --- Fast Sampling with Cover Trees
    Satwik Kottur
    Jose Moura
    Alex J. Smola
    ICML 2017 (2017)
    Preview abstract Hierarchical Bayesian models often capture distributions over a very large number of distinct atoms. The need for this arises when organizing huge amount of unsupervised data, for instance, features extracted using deep convnets can be exploited to organize abundant unlabeled images. Inference for hierarchical Bayesian models in such cases can be rather nontrivial, leading to approximate approaches. In this work, we propose a sampler based on Cover Trees that is exact and that has guaranteed runtime logarithmic in the number of atoms and is polynomial in the inherent dimensionality of the underlying parameter space. In other words, the algorithm is as fast as search over a hierarchical data structure and we demonstrate the effectiveness on both synthetic and real datasets, consisting of over 100 million images. View details
    No Results Found