Qiuyi (Richard) Zhang
I am currently at Google Brain/AI in Pittsburgh, working on hyperparameter optimization, Bayesian methods, and theoretical deep learning. My personal website is here.
Research Areas
Authored Publications
Sort By
Preview abstract
What are dimensions of human intent, and how do writing tools shape and augment these expressions? From papyrus to auto-complete, a major turning point was when Alan Turing famously asked, “Can Machines Think?” If so, should we offload aspects of our thinking to machines, and what impact do they have in enabling the intentions we have? This paper adapts the Authorial Leverage framework, from the Intelligent Narrative Technologies literature, for evaluating recent generative model advancements. With increased widespread access to Large Language Models (LLMs), the evolution of our evaluative frameworks follow suit. To do this, we discuss previous expert studies of deep generative models for fiction writers and playwrights, and propose two future directions, (1) author-focused and (2) audience-focused, for furthering our understanding of Authorial Leverage of LLMs, particularly in the domain of comedy writing.
View details
Preview abstract
We study the setting of optimizing with bandit feedback with additional prior knowledge provided to the learner in the form of an initial hint of the optimal action. We present a novel algorithm for stochastic linear bandits that uses this hint to improve its regret to $\tilde O(\sqrt{T})$ when the hint is accurate, while maintaining a minimax-optimal $\tilde O(d\sqrt{T})$ regret independent of the quality of the hint. Furthermore, we provide a Pareto frontier of tight tradeoffs between best-case and worst-case regret, with matching lower bounds. Perhaps surprisingly, our work shows that leveraging a hint shows provable gains without sacrificing worst-case performance, implying that our algorithm adapts to the quality of the hint for free. We also provide an extension of our algorithm to the case of $m$ initial hints, showing that we can achieve a $\tilde O(m^{2/3}\sqrt{T})$ regret.
View details
One network fits all? Modular versus monolithic task formulations in neural networks
Abhimanyu Das
Atish Agarwala
Brendan Juba
Vatsal Sharan
ICLR 2021 (2021)
Preview abstract
Can deep learning solve multiple, very different tasks simultaneously? We investigate how the representations of the underlying tasks affect the ability of a single neural network to learn them jointly. We present theoretical and empirical findings that a single neural network is capable of simultaneously learning multiple tasks from a combined data set, for a variety of methods for representing tasks---for example, when the distinct tasks are represented by well-separated clusters or decision trees over some task-code attributes. Indeed, more strongly, we present a novel analysis that shows that families of simple programming-like constructs for the task codings are learnable by two-layer neural networks with standard training. We study more generally how the complexity of learning such combined tasks grows with the complexity of the task codes; we find that learning many tasks can be provably hard, even though the individual tasks are easy to learn. We provide empirical support for the usefulness of the learning bounds by training networks on clusters, decision trees, and SQL-style aggregation.
View details
Preview abstract
Matrix trace estimation is ubiquitous in machine learning applications and has traditionally relied on a simplistic Hutchinson's method, which requires $O(\log(1/\delta)/\epsilon^2)$ matrix-vector product queries to achieve an $\epsilon$ additive error with failure probability $\delta$. Recently, the Hutch++ algorithm was proposed, which reduces the number of matrix-vector queries from $O(1/\epsilon^2)$ to the optimal $O(1/\epsilon)$ on positive-semidefinite input matrices $A$, achieving a $(1\pm \epsilon)$-multiplicative approximation to the trace of $A$ with constant probability; however, in the high probability setting, the non-adaptive method suffers an extra $O(\sqrt{\log(1/\delta)})$ multiplicative factor in its query complexity. Non-adaptive methods are important, as they correspond to sketching algorithms, which are mergeable, highly parallelizable, and provide low-memory streaming algorithms as well as low-communication distributed protocols. In this work, we close the gap between non-adaptive and adaptive algorithms, showing that even non-adaptive algorithms can achieve $O(\sqrt{\log(1/\delta)}/\epsilon + \log(1/\delta))$ matrix-vector products. In addition, we prove matching lower bounds demonstrating that, up to a $\log \log(1/\delta)$ factor, no further improvement in the dependence on $\delta$ or $\epsilon$ is possible by any non-adaptive algorithm. Finally, our experiments demonstrate the superior performance of our sketch over adaptive methods, which are not parallelizable, as well as over the standard non-adaptive Hutchinson's method.
View details
Preview abstract
Single-objective black box optimization (also known as zeroth-order
optimization) is the process of minimizing a scalar objective $f(x)$, given evaluations at adaptively chosen inputs $x$. In this paper, we
consider multi-objective optimization, where $f(x)$ outputs a vector of
possibly competing objectives and the goal is to converge to the Pareto frontier. Quantitatively, we wish to maximize the standard \emph{hypervolume indicator} metric, which measures the dominated hypervolume of the entire set of chosen inputs. In this paper, we introduce a novel scalarization function, which we term the \emph{hypervolume scalarization}, and show that drawing random scalarizations from an appropriately chosen distribution can be used to efficiently approximate the \emph{hypervolume indicator} metric. We utilize this connection to show that Bayesian optimization with our scalarization via common acquisition functions, such as Thompson Sampling or Upper Confidence Bound, provably converges to the whole Pareto frontier by deriving tight \emph{hypervolume regret} bounds on the order of $\widetilde{O}(\sqrt{T})$. Furthermore, we highlight the general utility of our scalarization framework by showing that any provably convergent single-objective optimization process can be converted to a multi-objective optimization process with provable convergence guarantees.
View details
Gradientless Descent: High-Dimensional Zeroth-Order Optimization
ICLR 2020 (2020)
Preview abstract
Zeroth-order optimization is the process of minimizing an objective $f(x)$ given oracle access to evaluations at adaptively chosen inputs $x$. In this paper, we present two simple yet powerful GradientLess Descent (GLD) algorithms that do not rely on an underlying gradient estimate and are numerically stable. We analyze our algorithm from a novel geometric perspective and we derive two invariance properties of our algorithm: monotone and affine invariance. Specifically, for {\it any monotone transform} of a smooth and strongly convex objective with latent dimension $k$, then we present a novel analysis that shows convergence within an $\epsilon$-ball of the optimum in $O(kQ\log(n)\log(R/\epsilon))$ evaluations, where the input dimension is $n$, $R$ is the diameter of the input space and $Q$ is the condition number. Our rates are the first of its kind to be both 1) poly-logarithmically dependent on dimensionality and 2) invariant under monotone transformations. From our geometric perspective, we can further show that our analysis is optimal. We emphasize that monotone and affine invariance are key to the empirical success of gradientless algorithms, as demonstrated on BBOB and MuJoCo benchmarks.
View details
Preview abstract
The tremendous success of deep neural networks has motivated the need to better understand the
fundamental properties of these networks, but many of the theoretical results proposed have only
been for shallow networks. In this paper, we study an important primitive for understanding the
meaningful input space of a deep network: span recovery. For k < n, let A ∈ R
k×n be the
innermost weight matrix of an arbitrary feed forward neural network M : R
n → R, so M(x) can be
written as M(x) = σ(Ax), for some network σ : R
k → R. The goal is then to recover the row span
of A given only oracle access to the value of M(x). We show that if M is a multi-layered network
with ReLU activation functions, then partial recovery is possible: namely, we can provably recover
k/2 linearly independent vectors in the row span of A using poly(n) non-adaptive queries to M(x).
Furthermore, if M has differentiable activation functions, we demonstrate that full span recovery is
possible even when the output is first passed through a sign or 0/1 thresholding function; in this case
our algorithm is adaptive. Empirically, we confirm that full span recovery is not always possible,
but only for unrealistically thin layers. For reasonably wide networks, we obtain full span recovery
on both random networks and networks trained on MNIST data. Furthermore, we demonstrate the
utility of span recovery as an attack by inducing neural networks to misclassify data obfuscated by
controlled random noise as sensical inputs.
View details