Publications

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

people standing in front of a screen with images and a chipboard

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
1 - 15 of 10133 publications
    USER-LLM: Efficient LLM Contextualization with User Embedding
    Jiaxing Wu
    Neo Wu
    Devora Berlowitz
    Sushant Prakash
    Bradley Green
    Shawn O'Banion
    Jun Xie
    ArXiv (2024) (to appear)
    Preview abstract Large language models (LLMs) have revolutionized natural language processing. However, effectively incorporating complex and potentially noisy user interaction data remains a challenge. To address this, we propose User-LLM, a novel framework that leverages user embeddings to contextualize LLMs. These embeddings, distilled from diverse user interactions using self-supervised pretraining, capture latent user preferences and their evolution over time. We integrate these user embeddings with LLMs through cross-attention and soft-prompting, enabling LLMs to dynamically adapt to user context. Our comprehensive experiments on MovieLens, Amazon Review, and Google Local Review datasets demonstrate significant performance gains across various tasks. Notably, our approach outperforms text-prompt-based contextualization on long sequence tasks and tasks that require deep user understanding while being computationally efficient. We further incorporate Perceiver layers to streamline the integration between user encoders and LLMs, reducing computational demands. View details
    Enhancing Trust and Safety in Digital Payments: An LLM-Powered Approach
    Anant Modwal
    Govind Kaushal
    Ramanan Balakrishnan
    Shanay Shah
    Monu Agrawal
    Justin Lin
    Prakash Hariramani
    Priya Mandawat
    Rutvik Karve
    Naveen Madiraju
    Preview abstract Digital payment systems have revolutionized financial transactions, offering unparalleled convenience and accessibility to users worldwide. However, the increasing popularity of these platforms has also attracted malicious actors seeking to exploit their vulnerabilities for financial gain. To address this challenge, robust and adaptable scam detection mechanisms are crucial for maintaining the trust and safety of digital payment ecosystems. This paper presents a comprehensive approach to scam detection, focusing on the Unified Payments Interface (UPI) in India, Google Pay (GPay) as a specific use case. The approach leverages Large Language Models (LLMs) to enhance scam classification accuracy and designs a digital assistant to aid human reviewers in identifying and mitigating fraudulent activities. The results demonstrate the potential of LLMs in augmenting existing machine learning models and improving the efficiency, accuracy, quality, and consistency of scam reviews, ultimately contributing to a safer and more secure digital payment landscape. Our evaluation of the Gemini Ultra model on curated transaction data showed a 93.33% accuracy in scam classification. Furthermore, the model demonstrated 89% accuracy in generating reasoning for these classifications. A promising fact, the model identified 32% new accurate reasons for suspected scams that human reviewers had not included in the review notes. View details
    SceneFun3D: Fine-Grained Functionality and Affordance Understanding in 3D Scenes
    Delitzas Alexandros
    Ayça Takmaz
    Marc Pollefeys
    Francis Engelmann
    CVPR (2024) (to appear)
    Preview abstract Existing 3D scene understanding methods are heavily focused on 3D semantic and instance segmentation. However, identifying objects and their parts only constitutes an intermediate step towards a more fine-grained goal, which is effectively interacting with the functional interactive elements (e.g., handles, knobs, buttons) in the scene to accomplish diverse tasks. To this end, we introduce SceneFun3D, a large-scale dataset with more than 14.8k highly accurate interaction annotations for 710 high-resolution real-world 3D indoor scenes. We accompany the annotations with motion parameter information, describing how to interact with these elements, and a diverse set of natural language descriptions of tasks that involve manipulating them in the scene context. To showcase the value of our dataset, we introduce three novel tasks, namely functionality segmentation, task-driven affordance grounding and 3D motion estimation, and adapt existing state-of-the-art methods to tackle them. Our experiments show that solving these tasks in real 3D scenes remains challenging despite recent progress in closed-set and open-set 3D scene understanding methods. View details
    Sandwiched Compression: Repurposing Standard Codecs with Neural Network Wrappers
    Phil A. Chou
    Hugues Hoppe
    Danhang Tang
    Jonathan Taylor
    Philip Davidson
    arXiv:2402.05887 (2024)
    Preview abstract We propose sandwiching standard image and video codecs between pre- and post-processing neural networks. The networks are jointly trained through a differentiable codec proxy to minimize a given rate-distortion loss. This sandwich architecture not only improves the standard codec’s performance on its intended content, it can effectively adapt the codec to other types of image/video content and to other distortion measures. Essentially, the sandwich learns to transmit “neural code images” that optimize overall rate-distortion performance even when the overall problem is well outside the scope of the codec’s design. Through a variety of examples, we apply the sandwich architecture to sources with different numbers of channels, higher resolution, higher dynamic range, and perceptual distortion measures. The results demonstrate substantial improvements (up to 9 dB gains or up to 3 adaptations. We derive VQ equivalents for the sandwich, establish optimality properties, and design differentiable codec proxies approximating current standard codecs. We further analyze model complexity, visual quality under perceptual metrics, as well as sandwich configurations that offer interesting potentials in image/video compression and streaming. View details
    Preview abstract The state-of-the-art for training on-device language models for mobile keyboard applications combines federated learning (FL) with differential privacy (DP) via the DP-Follow-the-Regularized-Leader (DP-FTRL) algorithm. Two variants of DP-FTRL are used in practice, tree aggregation and matrix factorization. However, tree aggregation suffers from significantly suboptimal privacy/utility tradeoffs, while matrix mechanisms require expensive optimization parameterized by hard-to-estimate-in-advance constants, and high runtime memory costs.This paper extends the recently introduced Buffered Linear Toeplitz (BLT) mechanism to multi-participation scenarios. Our BLT-DP-FTRL maintains the ease-of-use advantages of tree aggregation, while essentially matching matrix factorization in terms of utility and privacy. We evaluate BLT-DP-FTRL on the StackOverflow dataset, serving as a re-producible simulation benchmark, and across four on-device language model tasks in a production FL system. Our empirical results highlight the advantages of the BLT mechanism and elevate the practicality and effectiveness of DP in real-world scenarios. View details
    Preview abstract Motivated by the increased adoption of autobidding algorithms in internet advertising markets, we study the design of optimal mechanisms for selling items to a value-maximizing buyer with a return-on-spend constraint. The buyer's values and target ratio in the return-on-spend constraint are private. We restrict attention to deterministic sequential screening mechanisms that can be implemented as a menu of prices paid for purchasing an item or not. The main result of this paper is to provide a characterization of an optimal mechanism. Surprisingly, we show that the optimal mechanism does not require target screening, i.e., offering a single pair of prices is optimal for the seller. The optimal mechanism is a subsidized posted price that provides a subsidy to the buyer to encourage participation and then charges a fixed unit price for each item sold. The seller's problem is a challenging non-linear mechanism design problem, and a key technical contribution of our work is to provide a novel approach to analyze non-linear pricing contracts. View details
    Partially Interpretable Models with Guarantees on Coverage and Accuracy
    Nave Frost
    Zachary Lipton
    Michal Moshkovitz
    Algorithmic Learning Theory (ALT) (2024)
    Preview abstract Simple, sufficient explanations furnished by short decision lists can be useful for guiding stakeholder actions. Unfortunately, this transparency can come at the expense of the higher accuracy enjoyed by black box methods, like deep nets. To date, practitioners typically either (i) insist on the simpler model, forsaking accuracy; or (ii) insist on maximizing accuracy, settling for post-hoc explanations of dubious faithfulness. In this paper, we propose a hybrid \emph{partially interpretable model} that represents a compromise between the two extremes. In our setup, each input is first processed by a decision list that can either execute a decision or abstain, handing off authority to the opaque model. The key to optimizing the decision list is to optimally trade off the accuracy of the composite system against coverage (the fraction of the population that receives explanations). We contribute a new principled algorithm for constructing partially interpretable decision lists, providing theoretical guarantees addressing both interpretability and accuracy. As an instance of our result, we prove that when the optimal decision list has length $k$, coverage $c$, and $b$ mistakes, our algorithm will generate a decision list that has length no greater than $4k$, coverage at least $c/2$, and makes at most $4b$ mistakes. Finally, we empirically validate the effectiveness of the new model. View details
    SCOREQ: Speech Quality Assessment with Contrastive Regression
    Alessandro Ragano
    Andrew Hines
    NeurIPS (2024) (to appear)
    Preview abstract In this paper, we present SCOREQ, a novel approach for speech quality prediction. SCOREQ is a triplet loss function for contrastive regression that addresses the domain generalisation shortcoming exhibited by state of the art no-reference speech quality metrics. In the paper we: (i) illustrate the problem of L2 loss training failing at capturing the continuous nature of the mean opinion score (MOS) labels; (ii) demonstrate the lack of generalisation through a benchmarking evaluation across several speech domains; (iii) outline our approach and explore the impact of the architectural design decisions through incremental evaluation; (iv) evaluate the final model against state of the art models for a wide variety of data and domains. The results show that the lack of generalisation observed in state of the art speech quality metrics is addressed by SCOREQ. We conclude that using a triplet loss function View details
    Preview abstract Online navigation platforms are well optimized to solve for the standard objective of minimizing the travel time and typically require precomputation-based architectures (such as Contraction Hierarchies and the Customizable Route Planning) to do so in a fast manner. The reason for this dependence is the size of the graph that represents the road network, which is large. The need to go beyond minimizing the travel time and introduce various types of customizations has led to approaches that rely on alternative route computation or, more generally, small subgraph extraction. On a small subgraph, one is able to run computationally expensive algorithms at query time and compute optimal solutions for multiple routing problems. In this framework, it is critical for the subgraph to a) be small and b) include (near) optimal routes for a collection of customizations. This is precisely the setting that we study in this work. We design algorithms that extract a subgraph connecting designated terminals with the objective to minimize the subgraph's size and the constraint to include near optimal routes for a set of predefined cost functions. We provide theoretical guarantees for our algorithms and evaluate them empirically using real world road networks. View details
    Preview abstract This paper discusses a method to inject text when training an ASR system without the need for up sampling the text sequence to match the length of the speech sequence. View details
    V2Meow: Meowing to the Visual Beat via Video-to-Music Generation
    Chris Donahue
    Dima Kuzmin
    Judith Li
    Kun Su
    Mauro Verzetti
    Qingqing Huang
    Yu Wang
    Vol. 38 No. 5: AAAI-24 Technical Tracks 5, AAAI Press (2024), pp. 4952-4960
    Preview abstract Video-to-music generation demands both a temporally localized high-quality listening experience and globally aligned video-acoustic signatures. While recent music generation models excel at the former through advanced audio codecs, the exploration of video-acoustic signatures has been confined to specific visual scenarios. In contrast, our research confronts the challenge of learning globally aligned signatures between video and music directly from paired music and videos, without explicitly modeling domain-specific rhythmic or semantic relationships. We propose V2Meow, a video-to-music generation system capable of producing high-quality music audio for a diverse range of video input types using a multi-stage autoregressive model. Trained on 5k hours of music audio clips paired with video frames mined from in-the-wild music videos, V2Meow is competitive with previous domain-specific models when evaluated in a zero-shot manner. It synthesizes high-fidelity music audio waveforms solely by conditioning on pre-trained general purpose visual features extracted from video frames, with optional style control via text prompts. Through both qualitative and quantitative evaluations, we demonstrate that our model outperforms various existing music generation systems in terms of visual-audio correspondence and audio quality. Music samples are available at tinyurl.com/v2meow. View details
    Preview abstract Capacitive touch sensors capture the two-dimensional spatial profile (referred to as a touch heatmap) of a finger's contact with a mobile touchscreen. However, the research and design of touchscreen mobile keyboards - one of the most speed- and accuracy-demanding touch interfaces - has focused on the location of the touch centroid derived from the touch image heatmap as the input, discarding the rest of the raw spatial signals. In this paper, we investigate whether touch heatmaps can be leveraged to further improve the tap decoding accuracy for mobile touchscreen keyboards. Specifically, we compared machine-learning models that decode user taps by using the centroids and/or the heatmaps as their input and studied the contribution due to the heatmap. The results show that adding the heatmap into the input feature set led to 21.4% relative reduction of character error rates on average, compared to using the centroid alone. Furthermore, we conducted online deployment testing of the heatmap-based decoder in a user study with 16 participants and observed lower error rate, faster typing speed, and higher self-reported satisfaction score based on the heatmap-based decoder than the centroid-based decoder. These findings underline the promise of utilizing touch heatmaps for improving typing experience in mobile keyboards. View details
    Drug Design on Quantum Computers
    Raffaele Santagati
    Alán Aspuru-Guzik
    Matthias Degroote
    Leticia Gonzalez
    Elica Kyoseva
    Nikolaj Moll
    Markus Oppel
    Robert Parrish
    Michael Streif
    Christofer Tautermann
    Horst Weiss
    Nathan Wiebe
    Clemens Utschig-Utschig
    Nature Physics (2024)
    Preview abstract The promised industrial applications of quantum computers often rest on their anticipated ability to perform accurate, efficient quantum chemical calculations. Computational drug discovery relies on accurate predictions of how candidate drugs interact with their targets in a cellular environment involving several thousands of atoms at finite temperatures. Although quantum computers are still far from being used as daily tools in the pharmaceutical industry, here we explore the challenges and opportunities of applying quantum computers to drug design. We discuss where these could transform industrial research and identify the substantial further developments needed to reach this goal. View details
    Preview abstract This is the seventh installment of the Developer Productivity for Humans column. This installment focuses on software quality: what it means, how developers see it, how we break it down into 4 types of quality, and the impact these have on each other. View details
    Preview abstract Motivated by recent advances in large language models for NLP, we design a time-series foundation model for forecasting whose out-of-the-box zero-shot performance on a variety of datasets, matches the accuracy of state-of-the-art supervised forecasting models for each individual dataset. Our model is based on pretraining a patched-decoder style attention model on a large time series dataset, and can work well across different forecasting history lengths, prediction lengths and temporal granularities. View details