Publications

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

people standing in front of a screen with images and a chipboard

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
1 - 15 of 10053 publications
    Privacy-Preserving Instructions for Aligning Large Language Models
    Da Yu
    Sewoong Oh
    International Conference on Machine Learning (ICML) (2024)
    Preview abstract Service providers of large language model (LLM) applications collect user instructions in the wild and use them in further aligning LLMs with users’ intentions. These instructions, which potentially contain sensitive information, are annotated by human workers in the process. This poses a new privacy risk not addressed by the typical private optimization. To this end, we propose using synthetic instructions to replace real instructions in data annotation and model fine-tuning. Formal differential privacy is guaranteed by generating those synthetic instructions using privately fine-tuned generators. Crucial in achieving the desired utility is our novel filtering algorithm that matches the distribution of the synthetic instructions to that of the real ones. In both supervised fine-tuning and reinforcement learning from human feedback, our extensive experiments demonstrate the high utility of the final set of synthetic instructions by showing comparable results to real instructions. In supervised fine-tuning, models trained with private synthetic instructions outperform leading open-source models such as Vicuna View details
    "We Need Structured Output": Towards User-centered Constraints on Large Language Model Output
    Michael Xieyang Liu
    Frederick Liu
    Alex Fiannaca
    Terry Koo
    In Extended Abstract in ACM CHI Conference on Human Factors in Computing Systems (CHI EA '24), ACM (2024), pp. 9 (to appear)
    Preview abstract Large language models can produce creative and diverse responses. However, to integrate them into current developer workflows, it is essential to constrain their outputs to follow specific formats or standards. In this work, we surveyed 51 experienced industry professionals to understand the range of scenarios and motivations driving the need for output constraints from a user-centered perspective. We identified 134 concrete use cases for constraints at two levels: low-level, which ensures the output adhere to a structured format and an appropriate length, and high-level, which requires the output to follow semantic and stylistic guidelines without hallucination. Critically, applying output constraints could not only streamline the currently repetitive process of developing, testing, and integrating LLM prompts for developers, but also enhance the user experience of LLM-powered features and applications. We conclude with a discussion on user preferences and needs towards articulating intended constraints for LLMs, alongside an initial design for a constraint prototyping tool. View details
    USM-SCD: USM-Based Multilingual Speaker Change Detection
    Yongqiang Wang
    Jason Pelecanos
    Yu Zhang
    Yiling Huang
    Han Lu
    ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 11801-11805
    Preview abstract We introduce a multilingual speaker change detection model (USM- SCD) that can simultaneously detect speaker turns and perform ASR for 96 languages. This model is adapted from a speech foundation model trained on a large quantity of supervised and unsupervised data, demonstrating the utility of fine-tuning from a large generic foundation model for a downstream task. We analyze the performance of this multilingual speaker change detection model through a series of ablation studies. We show that the USM-SCD model can achieve more than 75% average speaker change detection F1 score across a test set that consists of data from 96 languages. On American English, the USM-SCD model can achieve an 85.8% speaker change detection F1 score across various public and internal test sets, beating the previous monolingual baseline model by 21% relative. We also show that we only need to fine-tune one-quarter of the trainable model parameters to achieve the best model performance. The USM-SCD model exhibits state-of-the-art ASR quality compared with a strong public ASR baseline, making it suitable to handle both tasks with negligible additional computational cost. View details
    ChatDirector: Enhancing Video Conferencing with Space-Aware Scene Rendering and Speech-Driven Layout Transition
    Brian Moreno Collins
    Karthik Ramani
    Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, ACM, pp. 16 (to appear)
    Preview abstract Remote video conferencing systems (RVCS) are widely adopted in personal and professional communication. However, they often lack the co-presence experience of in-person meetings. This is largely due to the absence of intuitive visual cues and clear spatial relationships among remote participants, which can lead to speech interruptions and loss of attention. This paper presents ChatDirector, a novel RVCS that overcomes these limitations by incorporating space-aware visual presence and speech-aware attention transition assistance. ChatDirector employs a real-time pipeline that converts participants' RGB video streams into 3D portrait avatars and renders them in a virtual 3D scene. We also contribute a decision tree algorithm that directs the avatar layouts and behaviors based on participants' speech states. We report on results from a user study (N=16) where we evaluated ChatDirector. The satisfactory algorithm performance and complimentary subject user feedback imply that ChatDirector significantly enhances communication efficacy and user engagement. View details
    Optimizing quantum gates towards the scale of logical qubits
    Alexandre Bourassa
    Andrew Dunsworth
    Will Livingston
    Vlad Sivak
    Trond Andersen
    Yaxing Zhang
    Desmond Chik
    Jimmy Chen
    Charles Neill
    Alejo Grajales Dau
    Anthony Megrant
    Alexander Korotkov
    Vadim Smelyanskiy
    Yu Chen
    Nature Communications, 15 (2024), pp. 2442
    Preview abstract A foundational assumption of quantum error correction theory is that quantum gates can be scaled to large processors without exceeding the error-threshold for fault tolerance. Two major challenges that could become fundamental roadblocks are manufacturing high-performance quantum hardware and engineering a control system that can reach its performance limits. The control challenge of scaling quantum gates from small to large processors without degrading performance often maps to non-convex, high-constraint, and time-dynamic control optimization over an exponentially expanding configuration space. Here we report on a control optimization strategy that can scalably overcome the complexity of such problems. We demonstrate it by choreographing the frequency trajectories of 68 frequency-tunable superconducting qubits to execute single- and two-qubit gates while mitigating computational errors. When combined with a comprehensive model of physical errors across our processor, the strategy suppresses physical error rates by ~3.7× compared with the case of no optimization. Furthermore, it is projected to achieve a similar performance advantage on a distance-23 surface code logical qubit with 1057 physical qubits. Our control optimization strategy solves a generic scaling challenge in a way that can be adapted to a variety of quantum operations, algorithms, and computing architectures. View details
    Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
    Nitesh Bharadwaj Gundavarapu
    Luca Versari
    Kihyuk Sohn
    Agrim Gupta
    Xiuye Gu
    Alex Hauptmann
    Boqing Gong
    Lu Jiang
    ICLR (2024)
    Preview abstract While Large Language Models (LLMs) are the dominant models for generative tasks in language, they do not perform as well as diffusion models on image and video generation. To effectively use LLMs for visual generation, one crucial component is the visual tokenizer that maps pixel-space inputs to discrete tokens appropriate for LLM learning. In this paper, we introduce MAGVIT-v2, a video tokenizer designed to generate concise and expressive tokens for both videos and images using a common token vocabulary. Equipped with this new tokenizer, we show that LLMs outperform diffusion models on standard image and video generation benchmarks including ImageNet and Kinetics. In addition, we demonstrate that our tokenizer surpasses the previously top-performing video tokenizer on two more tasks: (1) video compression comparable to the next-generation video codec (VCC) according to human evaluations, and (2) learning effective representations for action recognition tasks. View details
    DySLIM: Dynamics Stable Learning by Invariant Measure for Chaotic Systems
    Yair Schiff
    Jeff Parker
    Volodymyr Kuleshov
    International Conference on Machine Learning (ICML) (2024)
    Preview abstract Learning dynamics from dissipative chaotic systems is notoriously difficult due to their inherent instability, as formalized by their positive Lyapunov exponents, which exponentially amplify errors in the learned dynamics. However, many of these systems exhibit ergodicity and an attractor: a compact and highly complex manifold, to which trajectories converge in finite-time, that supports an invariant measure, i.e., a probability distribution that is invariant under the action of the dynamics, which dictates the long-term statistical behavior of the system. In this work, we leverage this structure to propose a new framework that targets learning the invariant measure as well as the dynamics, in contrast with typical methods that only target the misfit between trajectories, which often leads to divergence as the trajectories’ length increases. We use our framework to propose a tractable and sample efficient objective that can be used with any existing learning objectives. Our Dynamics Stable Learning by Invariant Measure (DySLIM) objective enables model training that achieves better point-wise tracking and long-term statistical accuracy relative to other learning objectives. By targeting the distribution with a scalable regularization term, we hope that this approach can be extended to more complex systems exhibiting slowly-variant distributions, such as weather and climate models. Code to reproduce our experiments is available here: https://github.com/google-research/swirl-dynamics/tree/main/swirl_dynamics/projects/ergodic. View details
    Resolving Code Review Comments with Machine Learning
    Alexander Frömmgen
    Peter Choy
    Elena Khrapko
    Marcus Revaj
    2024 IEEE/ACM 46th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP) (to appear)
    Preview abstract Code reviews are a critical part of the software development process, taking a significant amount of the code authors’ and the code reviewers’ time. As part of this process, the reviewer inspects the proposed code and asks the author for code changes through comments written in natural language. At Google, we see millions of reviewer comments per year, and authors require an average of ∼60 minutes active shepherding time between sending changes for review and finally submitting the change. In our measurements, the required active work time that the code author must devote to address reviewer comments grows almost linearly with the number of comments. However, with machine learning (ML), we have an opportunity to automate and streamline the code-review process, e.g., by proposing code changes based on a comment’s text. We describe our application of recent advances in large sequence models in a real-world setting to automatically resolve code-review comments in the day-to-day development workflow at Google. We present the evolution of this feature from an asynchronous generation of suggested edits after the reviewer sends feedback, to an interactive experience that suggests code edits to the reviewer at review time. In deployment, code-change authors at Google address 7.5% of all reviewer comments by applying an ML-suggested edit. The impact of this will be to reduce the time spent on code reviews by hundreds of thousands of engineer hours annually at Google scale. Unsolicited, very positive feedback highlights that the impact of ML-suggested code edits increases Googlers’ productivity and allows them to focus on more creative and complex tasks. View details
    TextMesh: Generation of Realistic 3D Meshes From Text Prompts
    Christina Tsalicoglou
    Fabian Manhardt
    Michael Niemeyer
    3DV 2024 (2024)
    Preview abstract The ability to generate highly realistic 2D images from mere text prompts has recently made huge progress in terms of speed and quality, thanks to the advent of image diffusion models. Naturally, the question arises if this can be also achieved in the generation of 3D content from such text prompts. To this end, a new line of methods recently emerged trying to harness diffusion models, trained on 2D images, for supervision of 3D model generation using view dependent prompts. While achieving impressive results, these methods, however, have two major drawbacks. First, rather than commonly used 3D meshes, they instead generate neural radiance fields (NeRFs), making them impractical for most real applications. Second, these approaches tend to produce over-saturated models, giving the output a cartoonish looking effect. Therefore, in this work we propose a novel method for generation of highly realistic-looking 3D meshes. To this end, we extend NeRF to employ an SDF backbone, leading to improved 3D mesh extraction. In addition, we propose a novel way to finetune the mesh texture, removing the effect of high saturation and improving the details of the output 3D mesh. View details
    Limoncello: Prefetchers for Scale
    Carlos Villavieja
    Baris Kasikci
    Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Association for Computing Machinery, New York, NY, United States (2024)
    Preview abstract This paper presents Limoncello, a novel software system that dynamically configures data prefetching for high utilization systems. We demonstrate that in resource-constrained environments, such as large data centers, traditional methods of hardware prefetching can increase memory latency and decrease available memory bandwidth. To address this, Limoncello dynamically configures data prefetching, disabling hardware prefetchers when memory bandwidth utilization is high and leveraging targeted software prefetching to reduce cache misses when hardware prefetchers are disabled. Limoncello is software-centric and does not require any modifications to hardware. Our evaluation of the deployment on a real-world hyperscale system reveals that Limoncello unlocks significant performance gains for high-utilization systems: it improves application throughput by 10%, due to a 15% reduction in memory latency, while maintaining minimal change in cache miss rate for targeted library functions. View details
    Preview abstract Welcome to the 16th edition of this column on recent books and journal articles in the field of public opinion, survey methods, survey statistics, Big Data, data science, and user experience research. Special issues of journals have a space in this article because, in our view, they are like edited books. We also added review papers from the journal series of Annual Reviews because these papers are seminal state of the art write ups, a mini book, if you wish on a specific subject. This article is an update of the books and journals published in the 2022 article. Like the previous year, the books are organized by topic; this should help the readers to focus on their interests. You will note that we use very broad definitions of public opinion, survey methods, survey statistics, Big Data, data science, and user experience research. This is because there are many books published in different outlets that can be very useful to the readers of Survey Practice, even if they do not come from traditional sources of survey content. It is unlikely we have exhaustively listed all new books in each subcategory; we did our best scouting different resources and websites, but we take full responsibility for any omissions. The list is also focused only on books published in the English language and available for purchase (as an ebook or in print) at the time of this review (April 2024) and with the printed copyright year of 2023. Books are listed based on the relevance to the topic, and no judgment is made in terms of quality of the content. We let the readers do so. If you want to send information for the next issue, please send it to surveypractice.new.books@gmail.com. View details
    Scalable Learning of Segment-Level Traffic Congestion Functions
    Shushman Choudhury
    Aboudy Kreidieh
    Alexandre Bayen
    IEEE Intelligent Transportation Systems Conference (2024)
    Preview abstract We propose and study a data-driven framework for identifying traffic congestion functions (numerical relationships between observations of traffic variables) at global scale and segment-level granularity. In contrast to methods that estimate a separate set of parameters for each roadway, ours learns a single black-box function over all roadways in a metropolitan area. First, we pool traffic data from all segments into one dataset, combining static attributes with dynamic time-dependent features. Second, we train a feed-forward neural network on this dataset, which we can then use on any segment in the area. We evaluate how well our framework identifies congestion functions on observed segments and how it generalizes to unobserved segments and predicts segment attributes on a large dataset covering multiple cities worldwide. For identification error on observed segments, our single data-driven congestion function compares favorably to segment-specific model-based functions on highway roads, but has room to improve on arterial roads. For generalization, our approach shows strong performance across cities and road types: both on unobserved segments in the same city and on zero-shot transfer learning between cities. Finally, for predicting segment attributes, we find that our approach can approximate critical densities for individual segments using their static properties. View details
    Preview abstract Despite the longstanding adage "an image is worth a thousand words," creating accurate and hyper-detailed image descriptions for training Vision-Language models remains challenging. Current datasets typically have web-scraped descriptions that are short, low-granularity, and often contain details unrelated to the visual content. As a result, models trained on such data generate descriptions replete with missing information, visual inconsistencies, and hallucinations. To address these issues, we introduce ImageInWords (IIW), a carefully designed human-in-the-loop annotation framework for curating hyper-detailed image descriptions and a new dataset resulting from this process. We validate the framework through evaluations focused on the quality of the dataset and its utility for fine-tuning with considerations for readability, comprehensiveness, specificity, hallucinations, and human-likeness. Our dataset significantly improves across these dimensions compared to recently released datasets (+66%) and GPT-4V outputs (+48%). Furthermore, models fine-tuned with IIW data excel by +31% against prior work along the same human evaluation dimensions. Given our fine-tuned models, we also evaluate text-to-image generation and vision-language reasoning. Our model's descriptions can generate images closest to the original, as judged by both automated and human metrics. We also find our model produces more compositionally rich descriptions, outperforming the best baseline by up to 6% on ARO, SVO-Probes, and Winoground datasets. View details
    Preview abstract Specialized Large multi-modal models (LMMs) have exhibited remarkable performance across numerous tasks, however, generalist LMMs suffer from performance degradation when training with a large collection of tasks. Recent research suggests Mixture of Experts (MoE) Models help instruction tuning, however, for LMMs of parameter size around O(50-100B), the prohibitive cost of replicating and storing the expert models severely limits the number of experts we can use. We propose Omni-SMoLA that softly mixes many multimodal low rank experts to large models without introducing significant new parameter count compared to conventional MoE models. The core idea is that the large model provides a foundational backbone and different lightweight experts learn specialized knowledge residually. Extensive experiments demonstrate that the SMoLA approach helps improve the generalist performance across a broad range of visual question answering and captioning tasks, achieving a new state-of-the-art generalist performance that matches or outperforms single specialized LMM baselines. View details
    Preview abstract We present SPHEAR, an accurate, differentiable parametric statistical 3D human head model, enabled by a novel 3D registration method based on spherical embeddings. We shift the paradigm away from the classical Non-Rigid Registration methods, which operate under various surface priors, increasing reconstruction fidelity and minimizing required human intervention. Additionally, SPHEAR is a complete model that allows not only to sample diverse synthetic head shapes and facial expressions, but also gaze directions, high-resolution color textures, surface normal maps, and hair cuts represented in detail, as strands. SPHEAR can be used for automatic realistic visual data generation, semantic annotation, and general reconstruction tasks. Compared to state-of-the-art approaches, our components are fast and memory efficient, and experiments support the validity of our design choices and the accuracy of registration, reconstruction and generation techniques. View details