Jump to Content
Bo Dai

Bo Dai

My research interests lie on designing principled machine learning methods. Currently, I mainly focus on three major themes:
  • Reinforcement learning: design effective algorithms by exploiting the intrinsic structures in the uncertain dynamics for automatic decision making.
  • Learning to design algorithms: improve the algorithms, e.g., sampling, searching and planning, by leveraging empirical experiences.
  • Structured input and output: build effective models for capturing the structures information in input and output, e.g., binaries, sequences, programs, trees, and graphs.
More information can be found in Google Scholar and my personal homepage.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Can Small Heads Help? Understanding and Improving Multi-Task Generalization
    Christopher Fifty
    Dong Lin
    Li Wei
    Lichan Hong
    Yuyan Wang
    the WebConf 2022 (2022)
    Preview abstract A goal for multi-task learning from a multi-objective optimization perspective is to find the Pareto solutions that are not dominated by others. In this paper, we provide some insights on understanding the trade-off between Pareto efficiency and generalization, as a result of parameterization in deep learning: as a multi-objective optimization problem, enough parameterization is needed for handling task conflicts in a constrained solution space; however, from a multi-task generalization perspective, over-parameterization undermines the benefit of learning a shared representation which helps harder tasks or tasks with limited training examples. A delicate balance between multi-task generalization and multi-objective optimization is therefore needed for finding a better trade-off between efficiency and generalization. To this end, we propose a method of under-parameterized self-auxiliaries for multi-task models to achieve the best of both worlds. It is model-agnostic, task-agnostic and works with other multi-task learning algorithms. Empirical results show our method improves Pareto efficiency over existing popular algorithms on several multi-task applications. View details
    Preview abstract Stochastic dual dynamic programming~(SDDP) is one of the state-of-the-art algorithm for multi-stage stochastic optimization, yet its cost exponentially increases w.r.t. the size of decision variables, therefore, quickly becomes inapplicable for high-dimension problems. We introduce a neuralized component into SDDP, which outputs a \emph{piece-wise linear function} in a \emph{low-dimension} space to approximate the value function, based on the \emph{context of the problem instances}. The neuralized component will consistently evolve to abstract effective low-dimension action space and improve the quality of value function approximation for each problem based on prior successful experiences. It is seamlessly integrated with SDDP, formed our neural enhanced solver,~\AlgName~(\algshort), which achieves the optimality \emph{without loss of accuracy} in \emph{faster speed} for high-dimension and long-horizon multi-stage stochastic optimizations. We conduct thorough empirical experiments to demonstrate the benefits of \algshort from transferability on scalability.~\algshort significantly outperforms the competitors, including SDDP and variants of RL algorithms, in terms of solution quality and feasibility, and computational speed. View details
    Preview abstract Reliable automatic evaluation of dialogue systems under an interactive environment has long been overdue. An ideal environment for evaluating dialog systems, also known as the Turing test, needs to involve human interaction, which is usually not affordable for large-scale experiments. Though researchers have attempted to use metrics (e.g., perplexity, BLEU) in language generation tasks or some model-based reinforcement learning methods (e.g., self-play evaluation) for automatic evaluation, these methods only show a very weak correlation with the actual human evaluation in practice. To bridge such a gap, we propose a new framework named ENIGMA for estimating human evaluation scores based on recent advances of off-policy evaluation in reinforcement learning. ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation, making automatic evaluations feasible. More importantly, ENIGMA is model-free and agnostic to the behavior policies for collecting the experience data (see details in Section 2), which significantly alleviates the technical difficulties of modeling complex dialogue environments and human behaviors. Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores. View details
    Learning to Defend by Learning to Attack
    Haoming Jiang
    Zhehui Chen
    Yuyang Shi
    Tuo Zhao
    AISTATS 2021 (2021)
    Preview abstract Adversarial training provides a principled approach for training robust neural networks. From an optimization perspective, adversarial training is essentially solving a bilevel optimization problem. The leader problem is trying to learn a robust classifier, while the follower problem is trying to generate adversarial samples. Unfortunately, such a bilevel problem is difficult to solve due to its highly complicated structure. This work proposes a new adversarial training method based on a generic learning-to-learn (L2L) framework. Specifically, instead of applying existing hand-designed algorithms for the inner problem, we learn an optimizer, which is parametrized as a convolutional neural network. At the same time, a robust classifier is learned to defense the adversarial attack generated by the learned optimizer. Experiments over CIFAR-10 and CIFAR-100 datasets demonstrate that L2L outperforms existing adversarial training methods in both classification accuracy and computational efficiency. Moreover, our L2L framework can be extended to generative adversarial imitation learning and stabilize the training. View details
    Preview abstract Answering complex natural language questions on knowledge graphs (KGQA) is a challenging task. It requires reasoning with the input natural language questions as well as a massive, incomplete heterogeneous KG. Prior methods obtain an abstract structured query graph/tree from the input question and traverse the KG for answers following the query tree. However, they inherently cannot deal with missing links in the KG. Here we present LEGO, a Latent Execution-Guided reasOning framework to handle this challenge in KGQA. LEGO works in an iterative way, which alternates between (1) a Query Synthesizer, which synthesizes a reasoning action and grows the query tree step-by-step, and (2) a Latent Space Executor that executes the reasoning action in the latent embedding space to combat against the missing information in KG. To learn the synthesizer without step-wise supervision, we design a generic latent execution guided bottom-up search procedure to find good execution traces efficiently in the vast query space. Experimental results on several KGQA benchmarks demonstrate the effectiveness of our framework compared with previous state of the art. View details
    Preview abstract We study stochastic policy optimization in the on-policy case and make the following four contributions. \textit{First}, we show that the ordering of optimization algorithms by their efficiency gets reversed when they have or they not to the true gradient information. In particular, this finding implies that, unlike in the true gradient scenario, geometric information cannot be easily exploited without detrimental consequences in stochastic policy optimization. \textit{Second}, to explain these findings we introduce the concept of \textit{committal rate} for stochastic policy optimization, and show that this can serve as a criterion for determining almost sure convergence to global optimality. \textit{Third}, we show that if there is no external mechanism that allows an algorithm to determine the difference between optimal and sub-optimal actions using only on-policy samples, then there must be an inherent trade-off between exploiting geometry to accelerate convergence versus achieving optimality almost surely. That is, an algorithm either converges to a globally optimal policy with probability $1$ but at a rate no better than $O(1/t)$, or it achieves a faster than $O(1/t)$ convergence rate but then must fail to converge to the globally optimal deterministic policy with some positive probability. \textit{Finally}, we use our committal rate theory to explain why practical policy optimization methods are sensitive to random initialization, and how an ensemble method with parallelism can be guaranteed to achieve near-optimal solutions with high probability. View details
    Preview abstract Classical global convergence results for first-order methods rely on uniform smoothness and the Łojasiewicz inequality. Motivated by properties of objective functions that arise in machine learning, we propose a non-uniform refinement of these notions, leading to \emph{Non-uniform Smoothness} (NS) and \emph{Non-uniform Łojasiewicz inequality} (NŁ). The new definitions inspire new geometry-aware first-order methods that are able to converge to global optimality faster than the classical Ω(1/t2) lower bounds. To illustrate the power of these geometry-aware methods and their corresponding non-uniform analysis, we consider two important problems in machine learning: policy gradient optimization in reinforcement learning (PG), and generalized linear model training in supervised learning (GLM). For PG, we find that normalizing the gradient ascent method can accelerate convergence to O(e−t) while incurring less overhead than existing algorithms. For GLM, we show that geometry-aware normalized gradient descent can also achieve a linear convergence rate, which significantly improves the best known results. We additionally show that the proposed geometry-aware descent methods escape landscape plateaus faster than standard gradient descent. Experimental results are used to illustrate and complement the theoretical findings. View details
    On the Optimality of Batch Policy Optimization Algorithms
    Chenjun Xiao
    Yifan Wu
    Tor Lattimore
    Jincheng Mei
    Lihong Li
    ICML 2021 (2021)
    Preview abstract Batch policy optimization considers leveraging existing data for policy construction before interacting with an environment. Although interest in this problem has grown significantly in recent years, its theoretical foundations remain under-developed. To advance the understanding of this problem, we provide three results that characterize the limits and possibilities of batch policy optimization in the finite-armed stochastic bandit setting. First, we introduce a class of confidence-adjusted index algorithms that unifies optimistic and pessimistic principles in a common framework, which enables a general analysis. For this family, we show that any confidence-adjusted index algorithm is minimax optimal, whether it be optimistic, pessimistic or neutral. Our analysis reveals that instance-dependent optimality, commonly used to establish optimality of on-line stochastic bandit algorithms, cannot be achieved by any algorithm in the batch setting. In particular, for any algorithm that performs optimally in some environment, there exists another environment where the same algorithm suffers arbitrarily larger regret. Therefore, to establish a framework for distinguishing algorithms, we introduce a new weighted-minimax criterion that considers the inherent difficulty of optimal value prediction. We demonstrate how this criterion can be used to justify commonly used pessimistic principles for batch policy optimization. View details
    Preview abstract Retrosynthesis is the process of identifying a set of reactants to synthesize a target molecule. It is critical to material design and drug discovery. Existing machine learning approaches based on language models and graph neural networks have achieved encouraging results. However, the inner connections of these models are rarely discussed, and rigorous evaluations of these models are largely in need. In this paper, we propose a framework that unifies sequence- and graph-based methods as energy-based models (EBMs) with different energy functions. This unified view establishes connections and reveals the differences between models, thereby enhances our understanding of model design. We also provide a comprehensive assessment of performance to the community. Additionally, we present a novel dual variant within the framework that performs consistent training to induce the agreement between forward- and backward-prediction. This model improves the state-of-the-art of template-free methods with or without reaction types. View details
    Energy-Based Processes for Exchangeable Data
    Sherry Yang*
    International Conference on Machine Learning (2020)
    Preview abstract Recently there has been growing interest in modeling sets with exchangeability such as point clouds. A shortcoming of current approaches is that they restrict the cardinality of the sets considered or can only express limited forms of distribution over unobserved data. To overcome these limitations, we introduce Energy-Based Processes (EBPs), which extend energy based models to exchangeable data while allowing neural network parameterizations of the energy function. A key advantage of these models is the ability to express more flexible distributions over sets without restricting their cardinality. We develop an efficient training procedure for EBPs that demonstrates state-of-the-art performance on a variety of tasks such as point cloud generation, classification, denoising, and image completion. View details
    Differentiable Top-K Operator with Optimal Transport
    Yujia Xie
    Minshuo Chen
    Tuo Zhao
    Hongyuan Zha
    Wei Wei
    NeurIPS 2020
    Preview abstract Finding the k largest or smallest elements from a collection of scores, i.e., top-k operation, is an important model component widely used in information retrieval, machine learning, and data mining. However, if the top-k operation is implemented in an algorithmic way, e.g., using bubble algorithm, the resulted model cannot be trained in an end-to-end way using prevalent gradient descent algorithms. This is because these implementations typically involve swapping indices, whose gradient cannot be computed. Moreover, the corresponding mapping from the input scores to the indicator vector of whether this element belongs to the top-k set is essentially discontinuous. To address the issue, we propose a smoothed approximation, namely SOFT (Scalable Optimal transport-based diFferenTiable) top-k operator. Specifically, our SOFT top-k operator approximates the output of top-k operation as the solution of an Entropic Optimal Transport (EOT) problem. The gradient of the SOFT operator can then be efficiently approximated based on the optimality conditions of EOT problem. We then apply the proposed operator to k-nearest neighbors algorithm and beam search algorithm. The numerical experiment demonstrates their achieve improved performance. View details
    Preview abstract An important problem that arises in reinforcement learning and Monte Carlo methods is estimating quantities defined by the stationary distribution of a Markov chain. In many real-world applications, access to the underlying transition operator is limited to a fixed set of data that has already been collected, without additional interaction with the environment being available. We show that consistent estimation remains possible in this scenario, and that effective estimation can still be achieved in important applications. Our approach is based on estimating a ratio that corrects for the discrepancy between the stationary and empirical distributions, derived from fundamental properties of the stationary distribution, and exploiting constraint reformulations based on variational divergence minimization. The resulting algorithm, GenDICE, is straightforward and effective. We prove the consistency of the method under general conditions, provide a detailed error analysis, and demonstrate strong empirical performance on benchmark tasks, including off-line PageRank and off-policy policy evaluation. View details
    Escaping the Gravitational Pull of Softmax
    Jincheng Mei
    Chenjun Xiao
    Lihong Li
    Advances in Neural Information Processing Systems 33 (NeurIPS 2020)
    Preview abstract The softmax is the standard transformation used in machine learning to map real-valued vectors to categorical distributions. Unfortunately, the softmax poses serious drawbacks for gradient descent optimization. We establish two negative results for this transform: (1) optimizing any expectation with respect to the softmax must exhibit extreme sensitivity to parameter initialization (``the softmax gravity well''), and (2) optimizing log-probabilities under the softmax must exhibit slow convergence (``softmax damping''). Both findings are based on an analysis of convergence rates using the Lojasiewicz inequality. To circumvent these shortcomings we investigate an alternative transformation, the escort (p-norm) mapping, that demonstrates better optimization properties. In addition to proving bounds on convergence rates to firmly establish these results, we also provide experimental evidence for the superiority of the escort transformation. View details
    Preview abstract We study high-confidence behavior-agnostic off-policy evaluation in reinforcement learning, where the goal is to estimate a confidence interval on a target policy’s value, given only access to a static experience dataset collected by unknown behavior policies. Starting from a function space embedding of the linear program formulation of the Q-function, we obtain an optimization problem with generalized estimating equation constraints. By applying the generalized empirical likelihood method to the resulting Lagrangian, we propose CoinDICE, a novel and efficient algorithm for computing confidence intervals. Theoretically, we prove the obtained confidence intervals are valid, in both asymptotic and finite-sample regimes. Empirically, we show in a variety of benchmarks that the confidence interval estimates are tighter and more accurate than existing methods View details
    Preview abstract Learning graph generative models is a challenging task for deep learning and has wide applicability to a range of domains like chemistry, biology and social science. However current deep neural methods suffer from limited scalability: for a graph with n nodes and m edges, existing deep neural methods require Ω(n2) complexity by building up the adjacency matrix. On the other hand, many real world graphs are actually sparse in the sense that m≪n2. Based on this, we develop a novel autoregressive model, named BiGG, that utilizes this sparsity to avoid generating the full adjacency matrix, and importantly reduces the graph generation time complexity to O((n+m)log n). Furthermore, during training this autoregressive model can be parallelized with O(log n) synchronization stages, which makes it much more efficient than other autoregressive models that require Ω(n). Experiments on several benchmarks show that the proposed approach not only scales to orders of magnitude larger graphs than previously possible with deep autoregressive graph generative models, but also yields better graph generation quality. View details
    Preview abstract We consider the problem of approximating the stationary distribution of an ergodic Markov chain given a set of sampled transitions. Classical simulation-based approaches assume access to the underlying process so that trajectories of sufficient length can be gathered to approximate stationary sampling. Instead, we consider an alternative setting where a fixed set of transitions has been collected beforehand, by a separate, possibly unknown procedure. The goal is still to estimate properties of the stationary distribution, but without additional access to the underlying system. We propose a consistent estimator that is based on recovering a correction ratio function over the given data. In particular, we develop a variational power method (VPM) that provides provably consistent estimates under general conditions. In addition to unifying a number of existing approaches from different subfields, we also find that VPM yields significantly better estimates across a range of problems, including queueing, stochastic differential equations, post-processing MCMC, and off-policy evaluation. View details
    Learning Discrete Energy-based Models via Auxiliary-variable Local Exploration
    Rishabh Singh
    Advances in Neural Information Processing Systems (NeurIPS) (2020) (to appear)
    Preview abstract Discrete structures play an important role in applications like program language modeling, and software engineering. Current approaches to predicting complex structures typically consider autoregressive models for their tractability, with some sacrifice of the flexibility. Energy-based models (EBMs) on the other hand offer a more flexible and thus more powerful approach to modeling such distributions. Learning and inference with EBMs requires the partition function estimation which is intractable in general. For discrete structured data this is even more challenging due to the absence of gradients. In this paper we propose ALOE, a new algorithm for learning conditional and unconditional EBMs for discrete, structured data, where parameter gradients are estimated using a learned sampler mimicking random local search, and thus, achieving a better trade-off between flexibility and tractability. We show that the sampler can still be trained efficiently using a bias reduction principle that alternates between importance reweighted maximum likelihood estimation and Gibbs sampling. Experimentally, we show that learning local search leads to significant improvements on two different domains in program synthesis and software engineering. Most notably, our energy model guided fuzzer for software testing can achieve comparable performance to well engineered fuzzing engines like libfuzzer on some targets. View details
    Preview abstract We present an efficient algorithm for maximum likelihood estimation (MLE) of the general exponential family, even in cases when the energy function is represented by a deep neural network. We consider the primal-dual view of the MLE for the kinectics augmented model, which naturally introduces an adversarial dual sampler. The sampler will be represented by a novel neural network architectures, dynamics embeddings, mimicking the dynamical-based samplers, e.g., Hamiltonian Monte-Carlo and its variants. The dynamics embedding parametrization inherits the flexibility from HMC, and provides tractable entropy estimation of the augmented model. Meanwhile, it couples the adversarial dual samplers with the primal model, reducing memory and sample complexity. We further show that several existing estimators, including contrastive divergence (Hinton, 2002), score matching (Hyvärinen, 2005), pseudo-likelihood (Besag, 1975), noise-contrastive estimation (Gutmann and Hyvärinen, 2010), non-local contrastive objectives (Vickrey et al., 2010), and minimum probability flow (Sohl-Dickstein et al., 2011), can be recast as the special cases of the proposed method with different prefixed dual samplers. Finally, we empirically demonstrate the superiority of the proposed estimator against existing state-of-the-art methods on synthetic and real-world benchmarks. View details
    Preview abstract Retrosynthesis is one of the fundamental problems in organic chemistry. The task is to identify reactants that can be used to synthesize a specified product molecule. Recently, computer-aided retrosynthesis is finding renewed interest from both chemistry and computer science communities. Most existing approaches rely on template-based models that define subgraph matching rules, but whether or not a chemical reaction can proceed is not defined by hard decision rules. In this work, we propose a new approach to this task using the Conditional Graph Logic Network, a conditional graphical model built upon graph neural networks that learns when rules from reaction templates should be applied, implicitly considering whether the resulting reaction would be both chemically feasible and strategic. We also propose an efficient hierarchical sampling to alleviate the computation cost. While achieving a significant improvement of 8.2% over current state-of-the-art methods on the benchmark dataset, our model also offers interpretations for the prediction. View details
    Preview abstract In this paper, we revisit the penalized MLE for learning the exponential family distribution whose natural parameter belongs to a reproducing kernel Hilbert space. We introduce the doubly dual embedding technique, by which the computation for the partition function is avoided. It also paves the path to learn a flexible sampler simultaneously, therefore, amortizing the cost of Monte-Carlo sampling in the inference stage. The estimator can be easily generalized for kernel conditional exponential family. Meanwhile, as a byproduct, we establish the connection between Wasserstein GAN and infinite-dimensional exponential family estimation, revealing a new perspective for understanding GANs. Comparing to the existing score matching based estimator initiated by Sriperumbudur et al. (2017), our method is not only more efficient in terms of both the memory and computational cost, but also achieves better statistical convergence rate. The proposed estimator outperforms the current state-of-the-art methods empirically on both kernel conditional and unconditional exponential family estimation. View details
    Preview abstract In many real-world reinforcement learning applications, access to the environment is limited to a fixed dataset, instead of direct (online) interaction with the environment. When using this data for either evaluation or training of a new policy, accurate estimates of discounted stationary distribution ratios -- correction terms which quantify the likelihood that the new policy will experience a certain state-action pair normalized by the probability with which the state-action pair appears in the dataset -- can improve accuracy and performance. In this work, we propose an algorithm, DualDICE, for estimating these quantities. In contrast to previous approaches, our algorithm is agnostic to knowledge of the behavior policy (or policies) used to generate the dataset. Furthermore, our algorithm eschews any direct use of importance weights, thus avoiding potential optimization instabilities endemic of previous methods. In addition to providing theoretical guarantees, we present an empirical study of our algorithm applied to off-policy policy evaluation and find that our algorithm significantly improves accuracy compared to existing techniques. View details
    Bayesian Meta-network Architecture Learning
    Albert Shaw
    Weiyang Liu
    Le Song
    NIPS 2018 Workshop on Bayesian Deep Learning (2018)
    Preview abstract For deep neural networks, the particular structure often plays a vital role in achieving state-of-the-art performances in many practical applications. There is much recent work focusing on designing novel structures for neural networks. However, due to the combinatorial nature of the design space, the hand-designing architectures is expensive and potentially sub-optimal. Developing techniques to automatically search this space has become a large focus of many recent efforts and methods such as genetic and reinforcement learning based algorithms have been quite successful in achieving state of the art performance on several tasks. However, the neural network structure in the existing methods are searched through strongly task dependent methods so the architecture search must be repeated for each new task. In this paper, we first propose a Bayesian view for differential architecture search, by which we can easily generalize the structure searching to few-shot meta-learning setting. Following the \emph{optimization embedding} technique~\citep{DaiDaiHeLiuetal18} for variational inference, we propose an efficient method for meta-network architecture searching. We test the algorithm on the few-shot learning benchmark, demonstrating the superiority of the proposed algorithm. View details
    Boosting the actor with dual critic
    Albert Shaw
    Niao He
    Lihong Li
    Le Song
    ICLR 2018
    Preview abstract This paper proposes a new actor-critic-style algorithm called Dual Actor-Critic or Dual-AC. It is derived in a principled way from the Lagrangian dual form of the Bellman optimality equation, which can be viewed as a two-player game between the actor and a critic-like function, which is named as dual critic. Compared to its actor-critic relatives, Dual-AC has the desired property that the actor and dual critic are updated cooperatively to optimize the same objective function, providing a more transparent way for learning the critic that is directly related to the objective function of the actor. We then provide a concrete algorithm that can effectively solve the minimax optimization problem, using techniques of multistep bootstrapping, path regularization, and stochastic dual ascent algorithm. We demonstrate that the proposed algorithm achieves state-of-the-art performance across several benchmarks. View details
    SBEED: Convergent Reinforcement Learning with Nonlinear Function Approximation
    Albert Shaw
    Lihong Li
    Lin Xiao
    Niao He
    Zhen Liu
    Jianshu Chen
    Le Song
    ICML 2018
    Preview abstract When function approximation is used, solving the Bellman optimality equation with stability guarantees has remained a major open problem in reinforcement learning for decades. The fundamental difficulty is that the Bellman operator may become an expansion in general, resulting in oscillating and even divergent behavior of popular algorithms like Q-learning. In this paper, we revisit the Bellman equation, and reformulate it into a novel primal-dual optimization problem using Nesterov's smoothing technique and the Legendre-Fenchel transformation. We then develop a new algorithm, called Smoothed Bellman Error Embedding, to solve this optimization problem where any differentiable function class may be used. We provide what we believe to be the first convergence guarantee for general nonlinear function approximation, and analyze the algorithm's sample complexity. Empirically, our algorithm compares favorably to state-of-the-art baselines in several benchmark control problems. View details
    Coupled Variational Bayes via Optimization Embedding
    Hanjun Dai
    Niao He
    Weiyang Liu
    Zhen Liu
    Jianshu Chen
    Lin Xiao
    Le Song
    NIPS 2018 (2018)
    Preview abstract Variational inference plays a vital role in learning graphical models, especially on large-scale datasets. Much of its success depends on a proper choice of auxiliary distribution class for posterior approximation. However, how to pursue an auxiliary distribution class that achieves both good approximation ability and computation efficiency remains a core challenge. In this paper, we proposed coupled variational Bayes which exploits the primal-dual view of the ELBO with the variational distribution class generated by an optimization procedure, which is termed optimization embedding. This flexible function class couples the variational distribution with the original parameters in the graphical models, allowing end-to-end learning of the graphical models by back-propagation through the variational distribution. Theoretically, we establish an interesting connection to gradient flow and demonstrate the extreme flexibility of this implicit distribution family in the limit sense. Empirically, we demonstrate the effectiveness of the proposed method on multiple graphical models with either continuous or discrete latent variables comparing to state-of-the-art methods. View details
    Preview abstract Learning-based binary hashing has become a powerful paradigm for fast search and retrieval in massive databases. However, due to the requirement of discrete outputs for the hash functions, learning such functions is known to be very challenging. In addition, the objective functions adopted by existing hashing techniques are mostly chosen heuristically. In this paper, we propose a novel generative approach to learn hash functions through Minimum Description Length principle such that the learned hash codes maximally compress the dataset and can also be used to regenerate the inputs. We also develop an efficient learning algorithm based on the stochastic distributional gradient, which avoids the notorious difficulty caused by binary output constraints, to jointly optimize the parameters of the hash function and the associated generative model. Extensive experiments on a variety of large-scale datasets show that the proposed method achieves better retrieval results than the existing state-of-the-art methods. View details
    No Results Found