Dale Schuurmans
Dale Schuurmans is a Research Scientist, Professor of Computing Science at the University of Alberta, Canada CIFAR AI Chair, Amii Fellow, and Fellow of the Association for the Advancement of Artificial Intelligence.
More information can be found at www.cs.ualberta.ca/~daes and on Google Scholar
Research Areas
Authored Publications
Sort By
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Jason Wei
Sharan Narang
Aakanksha Chowdhery
ICLR 2023 (to appear)
Preview abstract
Chain-of-thought prompting combined with pre-trained large language models has achieved encouraging results on complex reasoning tasks. In this paper, we propose a new decoding strategy, self-consistency, to replace the naive greedy decoding used in chain-of-thought prompting. It first samples a diverse set of reasoning paths instead of only taking the greedy one, and then selects the most consistent answer by marginalizing out the sampled reasoning paths. Self-consistency leverages the intuition that a complex reasoning problem typically admits multiple different ways of thinking leading to its unique correct answer. Our extensive empirical evaluation shows that self-consistency boosts the performance of chain-of-thought prompting with a striking margin on a range of popular arithmetic and commonsense reasoning benchmarks, including GSM8K (+17.9%), SVAMP (+11.0%), AQuA (+12.2%), StrategyQA (+6.4%) and ARC-challenge (+3.9%).
View details
Preview abstract
Careful prompt design is critical to the use of large language models in zero-shot or few-shot learning. As a consequence, there is a growing interest in automated methods to design optimal prompts. In this work, we propose Test-time Prompt Editing using Reinforcement learning (TEMPERA). In contrast to prior prompt generation methods, TEMPERA can efficiently leverage prior knowledge, is adaptive to different queries and provides an interpretable prompt for every query. To achieve this, we design a novel action space that allows flexible editing of the initial prompts covering a wide set of commonly-used components like instructions, few-shot exemplars, and verbalizers. The proposed method achieves significant gains compared with recent SoTA approaches like prompt tuning, AutoPrompt, and RLPrompt, across a variety of tasks including sentiment analysis, topic classification, natural language inference, and reading comprehension. Our method achieves 5.33x on average improvement in sample efficiency when compared to the traditional fine-tuning methods.
View details
Preview abstract
Approaches to policy optimization have been motivated from diverse principles, based on how the parametric model is interpreted or how the learning objective is formulated, yet they share a common goal of maximizing expected return. To better capture the commonalities and identify the key differences between alternative policy optimization methods, we develop a unified perspective that re-expresses the underlying update rules in terms of a limited choice of gradient form and a scaling function. In particular, we identify a unified space of approximate gradient updates for policy optimization that is highly structured, yet covers both classical and recent examples, including PPO. The primary benefit is that the framework also reveals novel but still well motivated updates that generalize existing algorithms in a way that can deliver benefits both in terms of convergence speed and final result quality. An experimental investigation demonstrates that the additional degrees of freedom identified in the unification can be leveraged to obtain non-trivial improvements both in synthetic domains and on popular deep RL benchmarks.
View details
Preview abstract
Score-based modeling through stochastic differential equations (SDEs) has provided a new perspective on diffusion models, and demonstrated superior performance on continuous data. However, the gradient of the log-likelihood function, i.e., the score function, is not properly defined for discrete spaces. This makes it non-trivial to adapt the score-based modeling to categorical data. In this paper, we extend diffusion models to discrete variables by introducing a stochastic jump process where the reverse process denoises via a continuous-time Markov chain. This formulation admits an analytical simulation during backward sampling. To learn the reverse process, we extend score matching to general categorical data, and show that an unbiased estimator can be obtained via simple matching of the conditional marginal distributions. We demonstrate the effectiveness of the proposed method on a set of synthetic and real-world music and image benchmarks.
View details
Emergent abilities of large language models
Barret Zoph
Colin Raffel
Dani Yogatama
Jason Wei
Liam B. Fedus
Maarten Paul Bosma
Percy Liang
Sebastian Borgeaud
Tatsunori B. Hashimoto
Yi Tay
TMLR (2022)
Preview abstract
Scaling up language models has been shown to predictably confer a range of benefits such as improved performance and sample efficiency. This paper discusses an unpredictable phenomenon that we call emergent abilities of large language models. Such emergent abilities have close to random performance until evaluated on a model of sufficiently large scale, and hence their emergence cannot be predicted by extrapolating a scaling law based on small-scale models. The emergence of such abilities suggests that additional scaling could further expand the range of tasks that language models can perform. We discuss the implications of these phenomena and suggest directions for future research.
View details
Preview abstract
Stochastic dual dynamic programming~(SDDP) is one of the state-of-the-art algorithm for multi-stage stochastic optimization, yet its cost exponentially increases w.r.t. the size of decision variables, therefore, quickly becomes inapplicable for high-dimension problems. We introduce a neuralized component into SDDP, which outputs a \emph{piece-wise linear function} in a \emph{low-dimension} space to approximate the value function, based on the \emph{context of the problem instances}. The neuralized component will consistently evolve to abstract effective low-dimension action space and improve the quality of value function approximation for each problem based on prior successful experiences. It is seamlessly integrated with SDDP, formed our neural enhanced solver,~\AlgName~(\algshort), which achieves the optimality \emph{without loss of accuracy} in \emph{faster speed} for high-dimension and long-horizon multi-stage stochastic optimizations. We conduct thorough empirical experiments to demonstrate the benefits of \algshort from transferability on scalability.~\algshort significantly outperforms the competitors, including SDDP and variants of RL algorithms, in terms of solution quality and feasibility, and computational speed.
View details
On the Optimality of Batch Policy Optimization Algorithms
Chenjun Xiao
Yifan Wu
Tor Lattimore
Jincheng Mei
Lihong Li
ICML 2021 (2021)
Preview abstract
Batch policy optimization considers leveraging existing data for policy construction before interacting with an environment. Although interest in this problem has grown significantly in recent years, its theoretical foundations remain under-developed. To advance the understanding of this problem, we provide three results that characterize the limits and possibilities of batch policy optimization in the finite-armed stochastic bandit setting. First, we introduce a class of confidence-adjusted index algorithms that unifies optimistic and pessimistic principles in a common framework, which enables a general analysis. For this family, we show that any confidence-adjusted index algorithm is minimax optimal, whether it be optimistic, pessimistic or neutral. Our analysis reveals that instance-dependent optimality, commonly used to establish optimality of on-line stochastic bandit algorithms, cannot be achieved by any algorithm in the batch setting. In particular, for any algorithm that performs optimally in some environment, there exists another environment where the same algorithm suffers arbitrarily larger regret. Therefore, to establish a framework for distinguishing algorithms, we introduce a new weighted-minimax criterion that considers the inherent difficulty of optimal value prediction. We demonstrate how this criterion can be used to justify commonly used pessimistic principles for batch policy optimization.
View details
Preview abstract
Classical global convergence results for first-order methods rely on uniform smoothness and the Łojasiewicz inequality. Motivated by properties of objective functions that arise in machine learning, we propose a non-uniform refinement of these notions, leading to \emph{Non-uniform Smoothness} (NS) and \emph{Non-uniform Łojasiewicz inequality} (NŁ). The new definitions inspire new geometry-aware first-order methods that are able to converge to global optimality faster than the classical Ω(1/t2) lower bounds. To illustrate the power of these geometry-aware methods and their corresponding non-uniform analysis, we consider two important problems in machine learning: policy gradient optimization in reinforcement learning (PG), and generalized linear model training in supervised learning (GLM). For PG, we find that normalizing the gradient ascent method can accelerate convergence to O(e−t) while incurring less overhead than existing algorithms. For GLM, we show that geometry-aware normalized gradient descent can also achieve a linear convergence rate, which significantly improves the best known results. We additionally show that the proposed geometry-aware descent methods escape landscape plateaus faster than standard gradient descent. Experimental results are used to illustrate and complement the theoretical findings.
View details
LEGO: Latent Execution-Guided Reasoning for Multi-Hop Question Answering on Knowledge Graphs
Hongyu Ren
Michihiro Yasunaga
Haitian Sun
Jure Leskovec
ICML 2021
Preview abstract
Answering complex natural language questions on knowledge graphs (KGQA) is a challenging task. It requires reasoning with the input natural language questions as well as a massive, incomplete heterogeneous KG. Prior methods obtain an abstract structured query graph/tree from the input question and traverse the KG for answers following the query tree. However, they inherently cannot deal with missing links in the KG. Here we present LEGO, a Latent Execution-Guided reasOning framework to handle this challenge in KGQA. LEGO works in an iterative way, which alternates between (1) a Query Synthesizer, which synthesizes a reasoning action and grows the query tree step-by-step, and (2) a Latent Space Executor that executes the reasoning action in the latent embedding space to combat against the missing information in KG. To learn the synthesizer without step-wise supervision, we design a generic latent execution guided bottom-up search procedure to find good execution traces efficiently in the vast query space. Experimental results on several KGQA benchmarks demonstrate the effectiveness of our framework compared with previous state of the art.
View details
Preview abstract
We study stochastic policy optimization in the on-policy case and make the following four contributions. \textit{First}, we show that the ordering of optimization algorithms by their efficiency gets reversed when they have or they not to the true gradient information. In particular, this finding implies that, unlike in the true gradient scenario, geometric information cannot be easily exploited without detrimental consequences in stochastic policy optimization. \textit{Second}, to explain these findings we introduce the concept of \textit{committal rate} for stochastic policy optimization, and show that this can serve as a criterion for determining almost sure convergence to global optimality. \textit{Third}, we show that if there is no external mechanism that allows an algorithm to determine the difference between optimal and sub-optimal actions using only on-policy samples, then there must be an inherent trade-off between exploiting geometry to accelerate convergence versus achieving optimality almost surely. That is, an algorithm either converges to a globally optimal policy with probability $1$ but at a rate no better than $O(1/t)$, or it achieves a faster than $O(1/t)$ convergence rate but then must fail to converge to the globally optimal deterministic policy with some positive probability. \textit{Finally}, we use our committal rate theory to explain why practical policy optimization methods are sensitive to random initialization, and how an ensemble method with parallelism can be guaranteed to achieve near-optimal solutions with high probability.
View details