Publications

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

people standing in front of a screen with images and a chipboard

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
1 - 15 of 10054 publications
    Augmented Object Intelligence with XR-Objects
    Mustafa Doga Dogan
    Eric Gonzalez
    Karan Ahuja
    Andrea Colaco
    Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology (UIST), ACM (2024), pp. 1-15
    Preview abstract Seamless integration of physical objects as interactive digital entities remains a challenge for spatial computing. This paper explores Augmented Object Intelligence (AOI) in the context of XR, an interaction paradigm that aims to blur the lines between digital and physical by equipping real-world objects with the ability to interact as if they were digital, where every object has the potential to serve as a portal to digital functionalities. Our approach utilizes real-time object segmentation and classification, combined with the power of Multimodal Large Language Models (MLLMs), to facilitate these interactions without the need for object pre-registration. We implement the AOI concept in the form of XR-Objects, an open-source prototype system that provides a platform for users to engage with their physical environment in contextually relevant ways using object-based context menus. This system enables analog objects to not only convey information but also to initiate digital actions, such as querying for details or executing tasks. Our contributions are threefold: (1) we define the AOI concept and detail its advantages over traditional AI assistants, (2) detail the XR-Objects system’s open-source design and implementation, and (3) show its versatility through various use cases and a user study. View details
    Preview abstract Effective model calibration is a critical and indispensable component in developing Media Mix Models (MMMs). One advantage of Bayesian-based MMMs lies in their capacity to accommodate the information from experiment results and the modelers' domain knowledge about the ad effectiveness by setting priors for the model parameters. However, it remains ambiguous about how and which Bayesian priors should be tuned for calibration purpose. In this paper, we propose a new calibration method through model reparameterization. The reparameterized model includes Return on Ads Spend (ROAS) as a model parameter, enabling straightforward adjustment of its prior distribution to align with either experiment results or the modeler's prior knowledge. The proposed method also helps address several key challenges regarding combining MMMs and incrementality experiments. We use simulations to demonstrate that our approach can significantly reduce the bias and uncertainty in the resultant posterior ROAS estimates. View details
    VideoPoet: A Large Language Model for Zero-Shot Video Generation
    Dan Kondratyuk
    Xiuye Gu
    Grant Schindler
    Rachel Hornung
    Vighnesh Birodkar
    Jimmy Yan
    Ming-Chang Chiu
    Hassan Akbari
    Josh Dillon
    Agrim Gupta
    Meera Hahn
    Anja Hauth
    David Hendon
    Alonso Martinez
    Kihyuk Sohn
    Xuan Yang
    Huisheng Wang
    Lu Jiang
    ICML (2024)
    Preview abstract We present VideoPoet, a language model capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs -- including images, videos, text, and audio. The training protocol follows that of Large Language Models (LLMs), consisting of two stages: pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal generative objectives within an autoregressive Transformer framework. The pretrained LLM serves as a foundation that can be adapted for a range of video generation tasks. We present empirical results demonstrating the model's state-of-the-art capabilities in zero-shot video generation, specifically highlighting VideoPoet's ability to generate high-fidelity motions. Project page: http://sites.research.google/videopoet/ View details
    Preview abstract While large, generative, multilingual models are rapidly being developed and deployed, their safety and fairness evaluations primarily hinge on resources collected in the English language and some limited translations. This has been demonstrated to be insufficient, and severely lacking in nuances of unsafe language and stereotypes prevalent in different languages and the geographical pockets they are prevalent in. Gathering these resources, at scale, in varied languages and regions also poses a challenge as it requires expansive sociolinguistic knowledge and can also be prohibitively expensive. We utilize an established methodology of coupling LLM generations with distributed annotations to overcome these gaps and create the resource SeeGULL Multilingual, spanning 20 languages across 23 regions. View details
    Preview abstract Existing 3D scene understanding methods are heavily focused on 3D semantic and instance segmentation. However, identifying objects and their parts only constitutes an intermediate step towards a more fine-grained goal, which is effectively interacting with the functional interactive elements (e.g., handles, knobs, buttons) in the scene to accomplish diverse tasks. To this end, we introduce SceneFun3D, a large-scale dataset with more than 14.8k highly accurate interaction annotations for 710 high-resolution real-world 3D indoor scenes. We accompany the annotations with motion parameter information, describing how to interact with these elements, and a diverse set of natural language descriptions of tasks that involve manipulating them in the scene context. To showcase the value of our dataset, we introduce three novel tasks, namely functionality segmentation, task-driven affordance grounding and 3D motion estimation, and adapt existing state-of-the-art methods to tackle them. Our experiments show that solving these tasks in real 3D scenes remains challenging despite recent progress in closed-set and open-set 3D scene understanding methods. View details
    Triply efficient shadow tomography
    Robbie King
    David Gosset
    arXiv:2404.19211 (2024)
    Preview abstract Given copies of a quantum state $\rho$, a shadow tomography protocol aims to learn all expectation values from a fixed set of observables, to within a given precision $\epsilon$. We say that a shadow tomography protocol is \textit{triply efficient} if it is sample- and time-efficient, and only employs measurements that entangle a constant number of copies of $\rho$ at a time. The classical shadows protocol based on random single-copy measurements is triply efficient for the set of local Pauli observables. This and other protocols based on random single-copy Clifford measurements can be understood as arising from fractional colorings of a graph $G$ that encodes the commutation structure of the set of observables. Here we describe a framework for two-copy shadow tomography that uses an initial round of Bell measurements to reduce to a fractional coloring problem in an induced subgraph of $G$ with bounded clique number. This coloring problem can be addressed using techniques from graph theory known as \textit{chi-boundedness}. Using this framework we give the first triply efficient shadow tomography scheme for the set of local fermionic observables, which arise in a broad class of interacting fermionic systems in physics and chemistry. We also give a triply efficient scheme for the set of all $n$-qubit Pauli observables. Our protocols for these tasks use two-copy measurements, which is necessary: sample-efficient schemes are provably impossible using only single-copy measurements. Finally, we give a shadow tomography protocol that compresses an $n$-qubit quantum state into a $\poly(n)$-sized classical representation, from which one can extract the expected value of any of the $4^n$ Pauli observables in $\poly(n)$ time, up to a small constant error. View details
    Optimization by Decoded Quantum Interferometry
    Stephen Jordan
    Noah Shutty
    Mary Wootters
    Alexander Schmidhuber
    Robbie King
    Sergei Isakov
    arXiv:2408.08292 (2024)
    Preview abstract We introduce Decoded Quantum Interferometry (DQI), a quantum algorithm for reducing classical optimization problems to classical decoding problems by exploiting structure in the Fourier spectrum of the objective function. DQI reduces sparse max-XORSAT to decoding LDPC codes, which can be decoded using powerful classical algorithms such as belief propagation. As an initial benchmark, we compare DQI using belief propagation decoding against classical optimization via simulated annealing. In this setting we identify a family of max-XORSAT instances where DQI achieves a better approximation ratio on average than simulated annealing, although not better than specialized classical algorithms tailored to those instances. We also analyze a combinatorial optimization problem corresponding to finding polynomials that intersect the maximum number of points. There, DQI efficiently achieves a better approximation ratio than any polynomial-time classical algorithm known to us, thus realizing an apparent exponential quantum speedup. Finally, we show that the problem defined by Yamakawa and Zhandry in order to prove an exponential separation between quantum and classical query complexity is a special case of the optimization problem efficiently solved by DQI. View details
    Preview abstract This is the seventh installment of the Developer Productivity for Humans column. This installment focuses on software quality: what it means, how developers see it, how we break it down into 4 types of quality, and the impact these have on each other. View details
    Quartic Quantum Speedups for Planted Inference Problems
    Alexander Schmidhuber
    Ryan O'Donnell
    arXiv:2406.19378 (2024)
    Preview abstract We describe a quantum algorithm for the Planted Noisy kXOR problem (also known as sparse Learning Parity with Noise) that achieves a nearly quartic (4th power) speedup over the best known classical algorithm while also only using logarithmically many qubits. Our work generalizes and simplifies prior work of Hastings, by building on his quantum algorithm for the Tensor Principal Component Analysis (PCA) problem. We achieve our quantum speedup using a general framework based on the Kikuchi Method (recovering the quartic speedup for Tensor PCA), and we anticipate it will yield similar speedups for further planted inference problems. These speedups rely on the fact that planted inference problems naturally instantiate the Guided Sparse Hamiltonian problem. Since the Planted Noisy kXOR problem has been used as a component of certain cryptographic constructions, our work suggests that some of these are susceptible to super-quadratic quantum attacks. View details
    Preview abstract Relational affect is the affective response (encompassing emotion, expression, feeling) that emerges from an interaction between two people. The case study presented here introduces the concept of relational affect through a human perceptual rating task. Forty-five raters watched short video clips of two people interacting and described their perceived emotion of the individuals and that of the overall interaction. Our qualitative analysis of the rater responses showed that raters used a variety of schemes to reason about emotion, including expressions, context, and perceived appraisal of the event. These reasoning schemes were notably different for perceived individual emotion and relational affect. Our findings show that the vocabulary use for relational affect is distinct from that of individual emotion and relational affect as a phenomenon deepens our understanding of social interactions and moves the field a step closer to realizing the goal of fluid interactions between people and technology. View details
    Preview abstract A product manager’s specific role varies from one company to the next. Still, all product managers balance many aspects of their job, including customers’ needs, a vision for new products, and the project team. So what tools and strategies are needed to create a successful career as a product manager? What are the “5 Things You Need To Create A Successful Career As A Product Manager”? Authority Magazine speaks with Aqsa Fulara, a product manager at Google to answer these questions with stories and insights from her experiences. View details
    SQL Has Problems. We Can Fix Them: Pipe Syntax In SQL
    Shannon Bales
    Matthew Brown
    Jean-Daniel Browne
    Brandon Dolphin
    Romit Kudtarkar
    Andrey Litvinov
    Jingchi Ma
    John Morcos
    Michael Shen
    David Wilhite
    Xi Wu
    Lulan Yu
    Proc. VLDB Endow. (2024), pp. 4051-4063 (to appear)
    Preview abstract SQL has been extremely successful as the de facto standard language for working with data. Virtually all mainstream database-like systems use SQL as their primary query language. But SQL is an old language with significant design problems, making it difficult to learn, difficult to use, and difficult to extend. Many have observed these challenges with SQL, and proposed solutions involving new languages. New language adoption is a significant obstacle for users, and none of the potential replacements have been successful enough to displace SQL. In GoogleSQL, we’ve taken a different approach - solving SQL’s problems by extending SQL. Inspired by a pattern that works well in other modern data languages, we added piped data flow syntax to SQL. The results are transformative - SQL becomes a flexible language that’s easier to learn, use and extend, while still leveraging the existing SQL ecosystem and existing userbase. Improving SQL from within allows incrementally adopting new features, without migrations and without learning a new language, making this a more productive approach to improve on standard SQL. View details
    On the Benefits of Traffic “Reprofiling” The Single Hop Case
    Henry Sariowan
    Jiaming Qiu
    Jiayi Song
    Roch Guerin
    IEEE/ACM Transactions on Networking (2024)
    Preview abstract Datacenters have become a significant source of traffic, much of which is carried over private networks. The operators of those networks commonly have access to detailed traffic profiles and performance goals, which they seek to meet as efficiently as possible. Of interest are solutions that guarantee latency while minimizing network bandwidth. The paper explores a basic building block towards realizing such solutions, namely, a single hop configuration. The main results are in the form of optimal solutions for meeting local deadlines under schedulers of varying complexity and therefore cost. The results demonstrate how judiciously modifying flows’ traffic profiles, i.e., reprofiling them, can help simple schedulers reduce the bandwidth they require, often performing nearly as well as more complex ones. View details
    V2Meow: Meowing to the Visual Beat via Video-to-Music Generation
    Chris Donahue
    Dima Kuzmin
    Judith Li
    Kun Su
    Mauro Verzetti
    Qingqing Huang
    Yu Wang
    Vol. 38 No. 5: AAAI-24 Technical Tracks 5, AAAI Press (2024), pp. 4952-4960
    Preview abstract Video-to-music generation demands both a temporally localized high-quality listening experience and globally aligned video-acoustic signatures. While recent music generation models excel at the former through advanced audio codecs, the exploration of video-acoustic signatures has been confined to specific visual scenarios. In contrast, our research confronts the challenge of learning globally aligned signatures between video and music directly from paired music and videos, without explicitly modeling domain-specific rhythmic or semantic relationships. We propose V2Meow, a video-to-music generation system capable of producing high-quality music audio for a diverse range of video input types using a multi-stage autoregressive model. Trained on 5k hours of music audio clips paired with video frames mined from in-the-wild music videos, V2Meow is competitive with previous domain-specific models when evaluated in a zero-shot manner. It synthesizes high-fidelity music audio waveforms solely by conditioning on pre-trained general purpose visual features extracted from video frames, with optional style control via text prompts. Through both qualitative and quantitative evaluations, we demonstrate that our model outperforms various existing music generation systems in terms of visual-audio correspondence and audio quality. Music samples are available at tinyurl.com/v2meow. View details
    Preview abstract Facilitated by large language models (LLMs), personalized text generation has become a rapidly growing research direction. Most existing studies focus on designing specialized models for a particular domain, or they require fine-tuning the LLMs to generate personalized text. We consider a typical scenario in which the large language model, which generates personalized output, is frozen and can only be accessed through APIs. Under this constraint, all one can do is to improve the input text (i.e., text prompts) sent to the LLM, a procedure that is usually done manually. In this paper, we propose a novel method to automatically revise prompts for personalized text generation. The proposed method takes the initial prompts generated by a state-of-the-art, multistage framework for personalized generation and rewrites a few critical components that summarize and synthesize the personal context. The prompt rewriter employs a training paradigm that chains together supervised learning (SL) and reinforcement learning (RL), where SL reduces the search space of RL and RL facilitates end-to-end training of the rewriter. Using datasets from three representative domains, we demonstrate that the rewritten prompts outperform both the original prompts and the prompts optimized via supervised learning or reinforcement learning alone. In-depth analysis of the rewritten prompts shows that they are not only human readable, but also able to guide manual revision of prompts when there is limited resource to employ reinforcement learning to train the prompt rewriter, or when it is costly to deploy an automatic prompt rewriter for inference. View details