Publications
Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.
Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.
Sort By
1 - 15 of 10133 publications
USER-LLM: Efficient LLM Contextualization with User Embedding
Jiaxing Wu
Neo Wu
Devora Berlowitz
Sushant Prakash
Bradley Green
Shawn O'Banion
Jun Xie
ArXiv (2024) (to appear)
Preview abstract
Large language models (LLMs) have revolutionized natural language processing. However, effectively incorporating complex and potentially noisy user interaction data remains a challenge. To address this, we propose User-LLM, a novel framework that leverages user embeddings to contextualize LLMs. These embeddings, distilled from diverse user interactions using self-supervised pretraining, capture latent user preferences and their evolution over time. We integrate these user embeddings with LLMs through cross-attention and soft-prompting, enabling LLMs to dynamically adapt to user context. Our comprehensive experiments on MovieLens, Amazon Review, and Google Local Review datasets demonstrate significant performance gains across various tasks. Notably, our approach outperforms text-prompt-based contextualization on long sequence tasks and tasks that require deep user understanding while being computationally efficient. We further incorporate Perceiver layers to streamline the integration between user encoders and LLMs, reducing computational demands.
View details
Enhancing Trust and Safety in Digital Payments: An LLM-Powered Approach
Anant Modwal
Govind Kaushal
Ramanan Balakrishnan
Shanay Shah
Monu Agrawal
Justin Lin
Prakash Hariramani
Priya Mandawat
Rutvik Karve
Naveen Madiraju
Preview abstract
Digital payment systems have revolutionized financial transactions, offering unparalleled convenience and accessibility to users worldwide. However, the increasing popularity of these platforms has also attracted malicious actors seeking to exploit their vulnerabilities for financial gain. To address this challenge, robust and adaptable scam detection mechanisms are crucial for maintaining the trust and safety of digital payment ecosystems. This paper presents a comprehensive approach to scam detection, focusing on the Unified Payments Interface (UPI) in India, Google Pay (GPay) as a specific use case. The approach leverages Large Language Models (LLMs) to enhance scam classification accuracy and designs a digital assistant to aid human reviewers in identifying and mitigating fraudulent activities. The results demonstrate the potential of LLMs in augmenting existing machine learning models and improving the efficiency, accuracy, quality, and consistency of scam reviews, ultimately contributing to a safer and more secure digital payment landscape. Our evaluation of the Gemini Ultra model on curated transaction data showed a 93.33% accuracy in scam classification. Furthermore, the model demonstrated 89% accuracy in generating reasoning for these classifications. A promising fact, the model identified 32% new accurate reasons for suspected scams that human reviewers had not included in the review notes.
View details
SceneFun3D: Fine-Grained Functionality and Affordance Understanding in 3D Scenes
Delitzas Alexandros
Ayça Takmaz
Marc Pollefeys
Francis Engelmann
CVPR (2024) (to appear)
Preview abstract
Existing 3D scene understanding methods are heavily focused on 3D semantic and instance segmentation. However, identifying objects and their parts only constitutes an
intermediate step towards a more fine-grained goal, which is effectively interacting with the functional interactive elements (e.g., handles, knobs, buttons) in the scene to accomplish diverse tasks. To this end, we introduce SceneFun3D, a large-scale dataset with more than 14.8k highly accurate interaction annotations for 710 high-resolution real-world 3D indoor scenes. We accompany the annotations with motion parameter information, describing how to interact with these elements, and a diverse set of natural language descriptions of tasks that involve manipulating them in the scene context. To showcase the value of our dataset, we introduce three novel tasks, namely functionality segmentation, task-driven affordance grounding and 3D motion estimation, and adapt existing state-of-the-art methods to tackle them. Our experiments show that solving these tasks in real 3D scenes remains challenging despite recent progress in closed-set and open-set 3D scene understanding methods.
View details
Sandwiched Compression: Repurposing Standard Codecs with Neural Network Wrappers
Phil A. Chou
Hugues Hoppe
Danhang Tang
Jonathan Taylor
Philip Davidson
arXiv:2402.05887 (2024)
Preview abstract
We propose sandwiching standard image and video codecs between pre- and post-processing neural networks. The networks are jointly trained through a differentiable codec proxy to minimize a given rate-distortion loss. This sandwich architecture not only improves the standard codec’s performance on its intended content, it can effectively adapt the codec to other types of image/video content and to other distortion measures. Essentially, the sandwich learns to transmit “neural code images” that optimize overall rate-distortion performance even when the overall problem is well outside the scope of the codec’s design. Through a variety of examples, we apply the sandwich architecture to sources with different numbers of channels, higher resolution, higher dynamic range, and perceptual distortion measures. The results demonstrate substantial improvements (up to 9 dB gains or up to 3 adaptations. We derive VQ equivalents for the sandwich, establish optimality properties, and design differentiable codec proxies approximating current standard codecs. We further analyze model complexity, visual quality under perceptual metrics, as well as sandwich configurations that offer interesting potentials in image/video compression and streaming.
View details
Preview abstract
The state-of-the-art for training on-device language models for mobile keyboard applications combines federated learning (FL) with differential privacy (DP) via the DP-Follow-the-Regularized-Leader (DP-FTRL) algorithm. Two variants of DP-FTRL are used in practice, tree aggregation and matrix factorization. However, tree aggregation suffers from significantly suboptimal privacy/utility tradeoffs, while matrix mechanisms require expensive optimization parameterized by hard-to-estimate-in-advance constants, and high runtime memory costs.This paper extends the recently introduced Buffered Linear Toeplitz (BLT) mechanism to multi-participation scenarios. Our BLT-DP-FTRL maintains the ease-of-use advantages of tree aggregation, while essentially matching matrix factorization in terms of utility and privacy. We evaluate BLT-DP-FTRL on the StackOverflow dataset, serving as a re-producible simulation benchmark, and across four on-device language model tasks in a production FL system. Our empirical results highlight the advantages of the BLT mechanism and elevate the practicality and effectiveness of DP in real-world scenarios.
View details
Optimal Mechanisms for a Value Maximizer: The Futility of Screening Targets
Proceedings of the 25th ACM Conference on Economics and Computation (EC) (2024)
Preview abstract
Motivated by the increased adoption of autobidding algorithms in internet advertising markets, we study the design of optimal mechanisms for selling items to a value-maximizing buyer with a return-on-spend constraint. The buyer's values and target ratio in the return-on-spend constraint are private. We restrict attention to deterministic sequential screening mechanisms that can be implemented as a menu of prices paid for purchasing an item or not. The main result of this paper is to provide a characterization of an optimal mechanism. Surprisingly, we show that the optimal mechanism does not require target screening, i.e., offering a single pair of prices is optimal for the seller. The optimal mechanism is a subsidized posted price that provides a subsidy to the buyer to encourage participation and then charges a fixed unit price for each item sold. The seller's problem is a challenging non-linear mechanism design problem, and a key technical contribution of our work is to provide a novel approach to analyze non-linear pricing contracts.
View details
Preview abstract
Simple, sufficient explanations
furnished by short decision lists
can be useful for guiding stakeholder actions.
Unfortunately, this transparency can come at the expense
of the higher accuracy enjoyed by black box methods,
like deep nets.
To date, practitioners typically either (i) insist on the simpler model, forsaking accuracy; or (ii) insist on maximizing accuracy, settling for post-hoc explanations of dubious faithfulness.
In this paper, we propose a hybrid \emph{partially interpretable model} that represents a compromise between the two extremes.
In our setup, each input is first processed by a decision list that can either execute a decision or abstain,
handing off authority to the opaque model.
The key to optimizing the decision list is to optimally
trade off the accuracy of the composite system
against coverage (the fraction of the population
that receives explanations).
We contribute a new principled algorithm for constructing partially interpretable decision lists,
providing theoretical guarantees
addressing both interpretability and accuracy.
As an instance of our result, we prove
that when the optimal decision list has length $k$, coverage $c$, and $b$ mistakes,
our algorithm will generate a decision list
that has length no greater than $4k$,
coverage at least $c/2$,
and makes at most $4b$ mistakes.
Finally, we empirically validate the effectiveness of the new model.
View details
Preview abstract
In this paper, we present SCOREQ, a novel approach for speech quality prediction. SCOREQ is a triplet loss function for contrastive regression that addresses the domain generalisation shortcoming exhibited by state of the art no-reference speech quality metrics. In the paper we: (i) illustrate the problem of L2 loss training failing at capturing the continuous nature of the mean opinion score (MOS) labels; (ii) demonstrate the lack of generalisation through a benchmarking evaluation across several speech domains; (iii) outline our approach and explore the impact of the architectural design decisions through incremental evaluation; (iv) evaluate the final model against state of the art models for a wide variety of data and domains. The results show that the lack of generalisation observed in state of the art speech quality metrics is addressed by SCOREQ. We conclude that using a triplet loss function
View details
Preview abstract
This paper discusses a method to inject text when training an ASR system without the need for up sampling the text sequence to match the length of the speech sequence.
View details
V2Meow: Meowing to the Visual Beat via Video-to-Music Generation
Chris Donahue
Dima Kuzmin
Judith Li
Kun Su
Mauro Verzetti
Qingqing Huang
Yu Wang
Vol. 38 No. 5: AAAI-24 Technical Tracks 5, AAAI Press (2024), pp. 4952-4960
Preview abstract
Video-to-music generation demands both a temporally localized high-quality listening experience and globally aligned video-acoustic signatures. While recent music generation models excel at the former through advanced audio codecs, the exploration of video-acoustic signatures has been confined to specific visual scenarios. In contrast, our research confronts the challenge of learning globally aligned signatures between video and music directly from paired music and videos, without explicitly modeling domain-specific rhythmic or semantic relationships. We propose V2Meow, a video-to-music generation system capable of producing high-quality music audio for a diverse range of video input types using a multi-stage autoregressive model. Trained on 5k hours of music audio clips paired with video frames mined from in-the-wild music videos, V2Meow is competitive with previous domain-specific models when evaluated in a zero-shot manner. It synthesizes high-fidelity music audio waveforms solely by conditioning on pre-trained general purpose visual features extracted from video frames, with optional style control via text prompts. Through both qualitative and quantitative evaluations, we demonstrate that our model outperforms various existing music generation systems in terms of visual-audio correspondence and audio quality. Music samples are available at tinyurl.com/v2meow.
View details
Can Capacitive Touch Images Enhance Mobile Keyboard Decoding?
Billy Dou
Cedric Ho
Proceedings of UIST 2024 (2024)
Preview abstract
Capacitive touch sensors capture the two-dimensional spatial profile (referred to as a touch heatmap) of a finger's contact with a mobile touchscreen. However, the research and design of touchscreen mobile keyboards - one of the most speed- and accuracy-demanding touch interfaces - has focused on the location of the touch centroid derived from the touch image heatmap as the input, discarding the rest of the raw spatial signals. In this paper, we investigate whether touch heatmaps can be leveraged to further improve the tap decoding accuracy for mobile touchscreen keyboards. Specifically, we compared machine-learning models that decode user taps by using the centroids and/or the heatmaps as their input and studied the contribution due to the heatmap. The results show that adding the heatmap into the input feature set led to 21.4% relative reduction of character error rates on average, compared to using the centroid alone. Furthermore, we conducted online deployment testing of the heatmap-based decoder in a user study with 16 participants and observed lower error rate, faster typing speed, and higher self-reported satisfaction score based on the heatmap-based decoder than the centroid-based decoder. These findings underline the promise of utilizing touch heatmaps for improving typing experience in mobile keyboards.
View details
Drug Design on Quantum Computers
Raffaele Santagati
Alán Aspuru-Guzik
Matthias Degroote
Leticia Gonzalez
Elica Kyoseva
Nikolaj Moll
Markus Oppel
Robert Parrish
Michael Streif
Christofer Tautermann
Horst Weiss
Nathan Wiebe
Clemens Utschig-Utschig
Nature Physics (2024)
Preview abstract
The promised industrial applications of quantum computers often rest on their anticipated ability to perform accurate, efficient quantum chemical calculations. Computational drug discovery relies on accurate predictions of how candidate drugs interact with their targets in a cellular environment involving several thousands of atoms at finite temperatures. Although quantum computers are still far from being used as daily tools in the pharmaceutical industry, here we explore the challenges and opportunities of applying quantum computers to drug design. We discuss where these could transform industrial research and identify the substantial further developments needed to reach this goal.
View details
Preview abstract
This is the seventh installment of the Developer Productivity for Humans column. This installment focuses on software quality: what it means, how developers see it, how we break it down into 4 types of quality, and the impact these have on each other.
View details
Preview abstract
Motivated by recent advances in large language models for NLP, we design a time-series foundation model for forecasting whose out-of-the-box zero-shot performance on a variety of datasets, matches the accuracy of state-of-the-art supervised forecasting models for each individual dataset. Our model is based on pretraining a patched-decoder style attention model on a large time series dataset, and can work well across different forecasting history lengths, prediction lengths and temporal granularities.
View details