Publications

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

people standing in front of a screen with images and a chipboard

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
1 - 15 of 10129 publications
    Secure by Design at Google
    Google Security Engineering (2024)
    Preview abstract This whitepaper provides an overview of Google's approach to secure design. View details
    V2Meow: Meowing to the Visual Beat via Video-to-Music Generation
    Chris Donahue
    Dima Kuzmin
    Judith Li
    Kun Su
    Mauro Verzetti
    Qingqing Huang
    Yu Wang
    Vol. 38 No. 5: AAAI-24 Technical Tracks 5, AAAI Press (2024), pp. 4952-4960
    Preview abstract Video-to-music generation demands both a temporally localized high-quality listening experience and globally aligned video-acoustic signatures. While recent music generation models excel at the former through advanced audio codecs, the exploration of video-acoustic signatures has been confined to specific visual scenarios. In contrast, our research confronts the challenge of learning globally aligned signatures between video and music directly from paired music and videos, without explicitly modeling domain-specific rhythmic or semantic relationships. We propose V2Meow, a video-to-music generation system capable of producing high-quality music audio for a diverse range of video input types using a multi-stage autoregressive model. Trained on 5k hours of music audio clips paired with video frames mined from in-the-wild music videos, V2Meow is competitive with previous domain-specific models when evaluated in a zero-shot manner. It synthesizes high-fidelity music audio waveforms solely by conditioning on pre-trained general purpose visual features extracted from video frames, with optional style control via text prompts. Through both qualitative and quantitative evaluations, we demonstrate that our model outperforms various existing music generation systems in terms of visual-audio correspondence and audio quality. Music samples are available at tinyurl.com/v2meow. View details
    Complex Dynamics in Autobidding Systems
    Georgios Piliouras
    Kelly Spendlove
    Proceedings of the 25th ACM Conference on Economics and Computation (2024)
    Preview abstract It has become the default in markets such as ad auctions for participants to bid in an auction through automated bidding agents (autobidders) which adjust bids over time to satisfy return-over-spend constraints. Despite the prominence of such systems for the internet economy, their resulting dynamical behavior is still not well understood. Although one might hope that such relatively simple systems would typically converge to the equilibria of their underlying auctions, we provide a plethora of results that show the emergence of complex behavior, such as bi-stability, periodic orbits and quasi periodicity. We empirically observe how the market structure (expressed as motifs) qualitatively affects the behavior of the dynamics. We complement it with theoretical results showing that autobidding systems can simulate both linear dynamical systems as well logical boolean gates. View details
    Analysis of objective and subjective sleep metrics and smartphone usage patterns
    Conor Heneghan
    Daniel McDuff
    Ari Winbush
    Nicholas Allen
    John Hernandez
    Allen Jiang
    Andrew Barakat
    Logan Schneider
    Benjamin Nelson
    Ben Yetton
    Preview abstract Analysis of objective and subjective sleep metrics and smartphone usage patterns Conor Heneghan, , Daniel McDuff, Ari Winbush, Nicholas Allen, John Hernandez, Allen Jiang,, Andrew Barakat, Logan Schneider, Benjamin Nelson, Ben Yetton Consumer Health Research Team, Google Inc. Department of Psychology, University of Oregon Verily Life Sciences Department of Psychiatry, Harvard Medical School and Beth Israel Deaconess Medical Center Introduction: The Digital Wellbeing Study is an IRB approved joint study between the University of Oregon and Google to investigate how smartphone usage interacts with objective and subjective parameters of well-being such as sleep, exercise and stress. The study recruited a demographically diverse population who each wore a smartwatch and installed a smartphone app linked to the study. Participants completed demographic and health questionnaires including the PROMIS Sleep Disturbance (SD) Short Form. Aims of the study included (a) whether objective sleep duration was correlated with smartphone use, and (b) whether smartphone usage could predict the subjective self reported sleep instrument. Methods: There was sufficient data from 7,499 users to conduct a population modeling analysis. An Ordinary Least Squares linear model was used as a predictor of each subject’s average total sleep time (TST) and their SD t-score. The inputs to the model included demographics, and population z-scored activity measures (steps, sedentary time, time driving, time at work, home and other locations, phone screen time, frequency of phone unlocks) over seven days prior to the survey. Results: The activity measures and baseline demographics could only explain a small amount of the overall variance in TST and SD (R^2=0.04 for TST and R^2=0.05 for SD). Phone screen time was a statistically significant predictor of both TST (-8.19 mins, p< 0.001) and self-reported sleep disruption (0.611 t-score units, p< 0.001). The number of phone unlocks was a predictor of variability in TST (-3.33 mins, p< 0.001) suggesting that longer session times are correlated with greater TST variability. The effects are minimal (e.g., a subject who has one standard deviation greater phone screen time than average would be predicted to only see a 2% reduction in TST, and a 0.6% increase in perceived sleep disturbance). Time driving and step count were also minor predictors of SD and TST. Conclusion: At a population level, average activity measures from wearables and smartphones such as steps, smartphone usage time, sedentary activity etc. are limited predictors of objective sleep metrics such as Total Sleep Time, and subjective sleep metrics such as the PROMIS Sleep Disturbance t-score. Support (if any): This research was funded by Google Inc. View details
    Preview abstract The articles delves into the promise of AI in business intelligence. It briefly reviews the evolution of BI and various Cloud tools, followed by the paradigm shift in how data is consumed. While AI brings huge potential, the article covers areas that enterprises must exercise caution over, when building intelligent agents to answer data questions. View details
    Preview abstract AI-generated images are proliferating as a new visual medium. However, state-of-the-art image generation models do not output alternative (alt) text with their images, rendering them largely inaccessible to screen reader users (SRUs). Moreover, less is known about what information would be most desirable to SRUs in this new medium. To address this, we invited AI image creators and SRUs to evaluate alt text prepared from various sources and write their own alt text for AI images. Our mixed-methods analysis makes three contributions. First, we highlight creators’ perspectives on alt text, as creators are well-positioned to write descriptions of their images. Second, we illustrate SRUs’ alt text needs particular to the emerging medium of AI images. Finally, we discuss the promises and pitfalls of utilizing text prompts written as input for AI models in alt text generation, and areas where broader digital accessibility guidelines could expand to account for AI images. View details
    Generative Powers of Ten
    Xiaojuan Wang
    Steve Seitz
    Ben Mildenhall
    Pratul Srinivasan
    Dor Verbin
    Aleksander Hołyński
    Preview abstract We present a method that uses a text-to-image model to generate consistent content across multiple image scales, enabling extreme semantic zooms into a scene, e.g., ranging from a wide-angle landscape view of a forest to a macro shot of an insect sitting on one of the tree branches. This representation allows us to render continuously zooming videos, or explore different scales of the scene interactively. We achieve this through a joint multi-scale diffusion sampling approach that encourages consistency across different scales while preserving the integrity of each individual sampling process. Since each generated scale is guided by a different text prompt, our method enables deeper levels of zoom than traditional super-resolution methods that may struggle to create new contextual structure at vastly different scales. We compare our method qualitatively with alternative techniques in image super-resolution and outpainting, and show that our method is most effective at generating consistent multi-scale content. View details
    Preview abstract Reinforcement can be a useful tool to solve combinatorial problems, even in the presence of constraints. This presentation details two use cases: one industrial application in the field of logistics, one of a more abstract problem in combinatorial optimization. View details
    Preview abstract We're roughly 10 years into the OpenConfig journey. We have implementations in hand from various vendors, and we've gained significant operational experience in the domains of Streaming Telemetry and in Developing Configuration Systems to leverage the developed models. What have we learned? Are the abstractions we've generated the right ones? If not, why? Were we too influenced by the tools and inertia of the time when we made some critical decisions? How do we need to evolve going forward? This discussion is part retrospective/introspective, a candid look at where we've been and what we need to think about as we evolve the next generation of our management (and control) planes. What should we be thinking about as network engineers who write software? View details
    Conversational AI in health: Design considerations from a Wizard-of-Oz dermatology case study with users, clinicians and a medical LLM
    Brenna Li
    Amy Wang
    Patricia Strachan
    Julie Anne Seguin
    Sami Lachgar
    Karyn Schroeder
    Renee Wong
    Extended Abstracts of the 2024 CHI Conference on Human Factors in Computing Systems, Association for Computing Machinery, pp. 10
    Preview abstract Although skin concerns are common, access to specialist care is limited. Artificial intelligence (AI)-assisted tools to support medical decisions may provide patients with feedback on their concerns while also helping ensure the most urgent cases are routed to dermatologists. Although AI-based conversational agents have been explored recently, how they are perceived by patients and clinicians is not well understood. We conducted a Wizard-of-Oz study involving 18 participants with real skin concerns. Participants were randomly assigned to interact with either a clinician agent (portrayed by a dermatologist) or an LLM agent (supervised by a dermatologist) via synchronous multimodal chat. In both conditions, participants found the conversation to be helpful in understanding their medical situation and alleviate their concerns. Through qualitative coding of the conversation transcripts, we provide insight on the importance of empathy and effective information-seeking. We conclude with design considerations for future AI-based conversational agents in healthcare settings. View details
    The Case for Validating Inputs in Software-Defined WANs
    Rishabh Iyer
    Isaac Keslassy
    Sylvia Ratnasamy
    The 23rd ACM Workshop on Hot Topics in Networks (HOTNETS ’24), ACM, Irvine, CA (2024) (to appear)
    Preview abstract We highlight a problem that the networking community has largely overlooked: ensuring that the inputs to network controllers in software- defined WANs are accurate. We we show that “incorrect” inputs are a common cause of major outages in practice and propose new directions to address these. View details
    FrameQuant: Flexible Low-Bit Quantization for Transformers
    Harshavardhan Adepu
    Zhanpeng Zeng
    Vikas Singh
    International Conference on Machine Learning (2024)
    Preview abstract Transformers are the backbone of powerful foundation models for many Vision and Natural Language Processing tasks. But their compute and memory/storage footprint is large, and so, serving such models is expensive often requiring high-end hardware. To mitigate this difficulty, Post-Training Quantization seeks to modify a pre-trained model and quantize it to eight bits or lower, significantly boosting compute/memory/latency efficiency. Such models have been successfully quantized to four bits with some performance loss. In this work, we outline a simple scheme to quantize Transformer-based models to just two bits (plus some overhead) with only a small drop in accuracy. Key to our formulation is a concept borrowed from Harmonic analysis called Fusion Frames. Our main finding is that the quantization must take place not in the original weight space, but instead in the Fusion Frame representations. If quantization is interpreted as the addition of noise, our casting of the problem allows invoking an extensive body of known consistent recovery and noise robustness guarantees. Further, if desired, denoising filters are known in closed form. We show empirically, via a variety of experiments, that (almost) two-bit quantization for Transformer models promises sizable efficiency gains. View details
    LabelMaker: Automatic Semantic Label Generation from RGB-D Trajectories
    Silvan Weder
    Hermann Blum
    Francis Engelmann
    Marc Pollefeys
    3DV (2024)
    Preview abstract Semantic annotations are indispensable to train or evaluate perception models, yet very costly to acquire. This work introduces a fully automated 2D/3D labeling framework that, without any human intervention, can generate labels for RGB-D scans at equal (or better) level of accuracy than comparable manually annotated datasets such as ScanNet. Our approach is based on an ensemble of state-of-the-art segmentation models and 3D lifting through neural rendering. We demonstrate the effectiveness of our LabelMaker pipeline by generating significantly better labels for the ScanNet datasets and automatically labelling the previously unlabeled ARKitScenes dataset. Code and models are available at https://labelmaker.org/ View details
    A versatile, semi-automated image analysis workflow for time-lapse camera trap image classification
    Hanna Böhner
    Olga Pokrovskaya
    Desheng Liu
    Natalia Sokolova
    Olivier Gilg
    Wenbo Zhou
    Ivan Fufachev
    Peter Ungar
    Rolf Anker Ims
    Alexsandr Sokolov
    Dorothee Ehrich
    Gerardo Celis
    Ecological Informatics (2024)
    Preview abstract Camera traps are a powerful, practical, and non-invasive method used widely to monitor animal communities and evaluate management actions. However, camera trap arrays can generate thousands to millions of images that require significant time and effort to review. Computer vision has emerged as a tool to accelerate this image review process. We propose a multi-step, semi-automated workflow which takes advantage of site-specific and generalizable models to improve detections and consists of (1) automatically identifying and removing low-quality images in parallel with classification into animals, humans, vehicles, and empty, (2) automatically cropping objects from images and classifying them (rock, bait, empty, and species), and (3) manually inspecting a subset of images. We trained and evaluated this approach using 548,627 images from 46 cameras in two regions of the Arctic: “Finnmark” (Finnmark County, Norway) and “Yamal” (Yamalo-Nenets Autonomous District, Russia). The automated steps yield image classification accuracies of 92% and 90% for the Finnmark and Yamal sets, respectively, reducing the number of images that required manual inspection to 9.2% of the Finnmark set and 3.9% of the Yamal set. The amount of time invested in developing models would be offset by the time saved from automation in about three seasons/years. Researchers can modify this multi-step process to develop their own site-specific models and meet other needs for monitoring and surveying wildlife, balancing the acceptable levels of false negatives and positives. View details
    Preview abstract Effective model calibration is a critical and indispensable component in developing Media Mix Models (MMMs). One advantage of Bayesian-based MMMs lies in their capacity to accommodate the information from experiment results and the modelers' domain knowledge about the ad effectiveness by setting priors for the model parameters. However, it remains ambiguous about how and which Bayesian priors should be tuned for calibration purpose. In this paper, we propose a new calibration method through model reparameterization. The reparameterized model includes Return on Ads Spend (ROAS) as a model parameter, enabling straightforward adjustment of its prior distribution to align with either experiment results or the modeler's prior knowledge. The proposed method also helps address several key challenges regarding combining MMMs and incrementality experiments. We use simulations to demonstrate that our approach can significantly reduce the bias and uncertainty in the resultant posterior ROAS estimates. View details