Publications
Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.
Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.
Sort By
1 - 15 of 10129 publications
Spoken Question Answering and Speech Continuation Using Spectrogram-Powered LLM
Alon Levkovitch
Roy Hirsch
Chulayuth Asawaroengchai
Ehud Rivlin
ICLR (2024)
Preview abstract
We present Spectron, a novel approach to adapting pre-trained large language models (LLMs) to perform spoken question answering (QA) and speech continuation. By endowing the LLM with a pre-trained speech encoder, our model becomes able to take speech inputs and generate speech outputs. The entire system is trained endto-end and operates directly on spectrograms, simplifying our architecture. Key to our approach is a training objective that jointly supervises speech recognition, text continuation, and speech synthesis using only paired speech-text pairs, enabling a ‘cross-modal’ chain-of-thought within a single decoding pass. Our method surpasses existing spoken language models in speaker preservation and semantic coherence. Furthermore, the proposed model improves upon direct initialization in retaining the knowledge of the original LLM as demonstrated through spoken QA datasets. We release our audio samples and spoken QA dataset via our website.
View details
Preview abstract
Browser fingerprinting is often associated with cross-site user tracking, a practice that many browsers (e.g., Safari, Brave, Edge, Firefox and Chrome) want to block. However, less is publicly known about its uses to enhance online safety, where it can provide an additional security layer against service abuses (e.g., in combination with CAPTCHAs) or during user authentication. To the best of our knowledge, no fingerprinting defenses deployed thus far consider this important distinction when blocking fingerprinting attempts, so they might negatively affect website functionality and security.
To address this issue we make three main contributions. First, we propose and evaluate a novel machine learning-based method to automatically identify authentication pages (i.e. sign-in and sign-up pages). Our algorithm -- which relies on a hybrid unsupervised/supervised approach -- achieves 96-98% precision and recall on a large, manually-labelled dataset of 10,000 popular sites. Second, we compare our algorithm with other methods from prior works on the same dataset, showing that it significantly outperforms all of them (+83% F1-score). Third, we quantify the prevalence of fingerprinting scripts across sign-in and sign-up pages (9.2%) versus those executed on other pages (8.9%); while the rates of fingerprinting are similar, home pages and authentication pages differ in the third-party scripts they include and how often these scripts are labeled as tracking. We also highlight the substantial differences in fingerprinting behavior on login and sign-up pages.
Our work sheds light on the complicated reality that fingerprinting is used to both protect user security and invade user privacy, and that this dual nature must be considered by fingerprinting mitigations.
View details
AI-Enhanced API Design: A New Paradigm in Usability and Efficiency
Mak Ahmad
David R Karger
Kwan-Liu Ma
CHI EA '24: Extended Abstracts of the 2024 CHI Conference on Human Factors in Computing Systems (2024)
Preview abstract
This study uses mixed methods to evaluate API design methods, focusing on design and consumption phases. Our goal was to understand the impact of API governance approaches on productivity and usability. A controlled developer experiment (n=34) demonstrated
a 10% increased requirement fulfillment using API Improvement Proposals (AIPs) and linter versus no protocols. Meanwhile, 73% of 33 surveyed API consumers preferred AIP-aligned designs for enhanced usability and comprehensibility. Complementing this, a
custom large language model called the API Architect received average expert ratings of just 5/10 for specification quality, revealing gaps versus manual design. The quantitative performance metrics combined with qualitative user feedback provide evidence from
multiple angles that strategically integrating industry best practices with maturing AI capabilities can meaningfully improve API design outcomes. This research offers empirical insights from developer and consumer perspectives to advance scholarly discourse
and industry practice regarding optimal API design workflows.
View details
Context-aware Transliteration of Romanized South Asian Languages
Christo Kirov
Computational Linguistics, 50 (2) (2024), 475–534
Preview abstract
While most transliteration research is focused on single tokens such as named entities -- e.g., transliteration of "અમદાવાદ" from the Gujarati script to the Latin script "Ahmedabad" -- the informal romanization prevalent in South Asia and elsewhere often requires transliteration of full sentences. The lack of large parallel text collections of full sentence (as opposed to single word) transliterations necessitates incorporation of contextual information into transliteration via non-parallel resources, such as via mono-script text collections. In this paper, we present a number of methods for improving transliteration in context for such a use scenario. Some of these methods in fact improve performance without making use of sentential context, allowing for better quantification of the degree to which contextual information in particular is responsible for system improvements. Our final systems, which ultimately rely upon ensembles including large pretrained language models finetuned on simulated parallel data, yield substantial improvements over the best previously reported results for full sentence transliteration from Latin to native script on all 12 languages in the Dakshina dataset (Roark et al. 2020), with an overall 4.8% absolute (27.1% relative) mean word-error rate reduction.
View details
Preview abstract
We focus on the problem of learning without forgetting from multiple tasks arriving sequentially, where each task is defined using a few-shot episode of novel or already seen classes. We approach this problem using the recently published HyperTransformer (HT), a Transformer-based hypernetwork that generates specialized task-specific CNN weights directly from the support set. In order to learn from a continual sequence of tasks, we propose to recursively re-use the generated weights as input to the HT for the next task. This way, the generated CNN weights themselves act as a representation of previously learned tasks, and the HT is trained to update these weights so that the new task can be learned without forgetting past tasks. This approach is different from most continual learning algorithms that typically rely on using replay buffers, weight regularization or task-dependent architectural changes. We demonstrate that our proposed Continual HyperTransformer method equipped with a prototypical loss is capable of learning and retaining knowledge about past tasks for a variety of scenarios, including learning from mini-batches, and task-incremental and class-incremental learning scenarios.
View details
Enhancing Trust and Safety in Digital Payments: An LLM-Powered Approach
Anant Modwal
Govind Kaushal
Ramanan Balakrishnan
Shanay Shah
Monu Agrawal
Justin Lin
Prakash Hariramani
Priya Mandawat
Rutvik Karve
Naveen Madiraju
Preview abstract
Digital payment systems have revolutionized financial transactions, offering unparalleled convenience and accessibility to users worldwide. However, the increasing popularity of these platforms has also attracted malicious actors seeking to exploit their vulnerabilities for financial gain. To address this challenge, robust and adaptable scam detection mechanisms are crucial for maintaining the trust and safety of digital payment ecosystems. This paper presents a comprehensive approach to scam detection, focusing on the Unified Payments Interface (UPI) in India, Google Pay (GPay) as a specific use case. The approach leverages Large Language Models (LLMs) to enhance scam classification accuracy and designs a digital assistant to aid human reviewers in identifying and mitigating fraudulent activities. The results demonstrate the potential of LLMs in augmenting existing machine learning models and improving the efficiency, accuracy, quality, and consistency of scam reviews, ultimately contributing to a safer and more secure digital payment landscape. Our evaluation of the Gemini Ultra model on curated transaction data showed a 93.33% accuracy in scam classification. Furthermore, the model demonstrated 89% accuracy in generating reasoning for these classifications. A promising fact, the model identified 32% new accurate reasons for suspected scams that human reviewers had not included in the review notes.
View details
Preview abstract
Reinforcement can be a useful tool to solve combinatorial problems, even in the presence of constraints. This presentation details two use cases: one industrial application in the field of logistics, one of a more abstract problem in combinatorial optimization.
View details
StreamVC: Real-Time Low-Latency Voice Conversion
Jiuqiang Tang
Xing Li
ICASSP 2024 (2024)
Preview abstract
We present StreamVC, a streaming voice conversion solution that preserves the content and prosody of any source speech while matching the voice timbre from any target speech. Unlike previous approaches, StreamVC produces the resulting waveform at low latency from the input signal even on a mobile platform, making it applicable to real-time communication scenarios like calls and video conferencing, and addressing use cases such as voice anonymization in these scenarios. Our design leverages the architecture and training strategy of the SoundStream neural audio codec for lightweight high-quality speech synthesis. We demonstrate the feasibility of learning soft speech units causally, as well as the effectiveness of supplying whitened fundamental frequency information to improve pitch stability without leaking the source timbre information.
View details
Artificial intelligence as a second reader for screening mammography
Etsuji Nakai
Alessandro Scoccia Pappagallo
Hiroki Kayama
Lin Yang
Shawn Xu
Christopher Kelly
Timo Kohlberger
Daniel Golden
Akib Uddin
Radiology Advances, 1(2) (2024)
Preview abstract
Background
Artificial intelligence (AI) has shown promise in mammography interpretation, and its use as a second reader in breast cancer screening may reduce the burden on health care systems.
Purpose
To evaluate the performance differences between routine double read and an AI as a second reader workflow (AISR), where the second reader is replaced with AI.
Materials and Methods
A cohort of patients undergoing routine breast cancer screening at a single center with mammography was retrospectively collected between 2005 and 2021. A model developed on US and UK data was fine-tuned on Japanese data. We subsequently performed a reader study with 10 qualified readers with varied experience (5 reader pairs), comparing routine double read to an AISR workflow.
Results
A “test set” of 4,059 women (mean age, 56 ± 14 years; 157 positive, 3,902 negative) was collected, with 278 (mean age 55 ± 13 years; 90 positive, 188 negative) evaluated for the reader study. We demonstrate an area under the curve =.84 (95% confidence interval [CI], 0.805-0.881) on the test set, with no significant difference to decisions made in clinical practice (P = .32). Compared with routine double reading, in the AISR arm, sensitivity improved by 7.6% (95% CI, 3.80-11.4; P = .00004) and specificity decreased 3.4% (1.42-5.43; P = .0016), with 71% (212/298) of scans no longer requiring input from a second reader. Variation in recall decision between reader pairs improved from a Cohen kappa of κ = .65 (96% CI, 0.61-0.68) to κ = .74 (96% CI, 0.71-0.77) in the AISR arm.
View details
Security Signals: Making Web Security Posture Measurable At Scale
David Dworken
Artur Janc
Santiago (Sal) Díaz
(2024) (to appear)
Preview abstract
The area of security measurability is gaining increased attention, with a wide range of organizations calling for the development of scalable approaches for assessing the security of software systems and infrastructure. In this paper, we present our experience developing Security Signals, a comprehensive system providing security measurability for web services, deployed in a complex application ecosystem of thousands of web services handling traffic from billions of users. The system collects security-relevant information from production HTTP traffic at the reverse proxy layer, utilizing novel concepts such as synthetic signals augmented with additional risk information to provide a holistic view of the security posture of individual services and the broader application ecosystem. This approach to measurability has enabled large-scale security improvements to our services, including allowing prioritized rollouts of security enhancements and the implementation of automated regression monitoring; it has proven valuable for security research and prioritization of defensive work. Security Signals addresses shortcomings of prior web measurability proposals by tracking a comprehensive set of security properties relevant to web applications, and by extracting insights from collected data for use by both security experts and non-experts. We believe the lessons learned from the implementation and use of Security Signals offer valuable insights for practitioners responsible for web service security, potentially inspiring new approaches to web security measurability.
View details
Expressing and Analyzing Quantum Algorithms with Qualtran
Charles Yuan
Anurudh Peduri
arXiv::2409.04643 (2024)
Preview abstract
Quantum computing's transition from theory to reality has spurred the need for novel software tools to manage the increasing complexity, sophistication, toil, and chance for error of quantum algorithm development. We present Qualtran, an open-source library for representing and analyzing quantum algorithms. Using carefully chosen abstractions and data structures, we can simulate and test algorithms, automatically generate information-rich diagrams, and tabulate resource requirements. Qualtran offers a \emph{standard library} of algorithmic building blocks that are essential for modern cost-minimizing compilations. Its capabilities are showcased through the re-analysis of key algorithms in Hamiltonian simulation, chemistry, and cryptography. The resulting architecture-independent resource counts can be forwarded to our implementation of cost models to estimate physical costs like wall-clock time and number of physical qubits assuming a surface-code architecture. Qualtran provides a foundation for explicit constructions and reproducible analysis, fostering greater collaboration within the quantum algorithm development community. We believe tools like Qualtran will accelerate progress in the field.
View details
Preview abstract
This is the seventh installment of the Developer Productivity for Humans column. This installment focuses on software quality: what it means, how developers see it, how we break it down into 4 types of quality, and the impact these have on each other.
View details
UniAR: A Unified model for predicting human Attention and Responses on visual content
Peizhao Li
Gang Li
Rachit Bhargava
Shaolei Shen
Youwei Liang
Hongxiang Gu
Venky Ramachandran
Golnaz Farhadi
Preview abstract
Progress in human behavior modeling involves understanding both implicit, early-stage perceptual behavior, such as human attention, and explicit, later-stage behavior, such as subjective preferences or likes. Yet most prior research has focused on modeling implicit and explicit human behavior in isolation; and often limited to a specific type of visual content. We propose UniAR – a unified model of human attention and preference behavior across diverse visual content. UniAR leverages a multimodal transformer to predict subjective feedback, such as satisfaction or aesthetic quality, along with the underlying human attention or interaction heatmaps and viewing order. We train UniAR on diverse public datasets spanning natural images, webpages, and graphic designs, and achieve SOTA performance on multiple benchmarks across various image domains and behavior modeling tasks. Potential applications include providing instant feedback on the effectiveness of UIs/visual content, and enabling designers and content-creation models to optimize their creation for human-centric improvements.
View details
FrameQuant: Flexible Low-Bit Quantization for Transformers
Harshavardhan Adepu
Zhanpeng Zeng
Vikas Singh
International Conference on Machine Learning (2024)
Preview abstract
Transformers are the backbone of powerful foundation models for many Vision and Natural Language Processing tasks. But their compute and memory/storage footprint is large, and so, serving such models is expensive often requiring high-end hardware. To mitigate this difficulty, Post-Training Quantization seeks to modify a pre-trained model and quantize it to eight bits or lower, significantly boosting compute/memory/latency efficiency. Such models have been successfully quantized to four bits with some performance loss. In this work, we outline a simple scheme to quantize Transformer-based models to just two bits (plus some overhead) with only a small drop in accuracy. Key to our formulation is a concept borrowed from Harmonic analysis called Fusion Frames. Our main finding is that the quantization must take place not in the original weight space, but instead in the Fusion Frame representations. If quantization is interpreted as the addition of noise, our casting of the problem allows invoking an extensive body of known consistent recovery and noise robustness guarantees. Further, if desired, denoising filters are known in closed form. We show empirically, via a variety of experiments, that (almost) two-bit quantization for Transformer models promises sizable efficiency gains.
View details
VideoPoet: A Large Language Model for Zero-Shot Video Generation
Dan Kondratyuk
Xiuye Gu
Jonathan Huang
Grant Schindler
Rachel Hornung
Vighnesh Birodkar
Jimmy Yan
Ming-Chang Chiu
Hassan Akbari
Josh Dillon
Agrim Gupta
Meera Hahn
Anja Hauth
David Hendon
Alonso Martinez
Kihyuk Sohn
Xuan Yang
Huisheng Wang
Lu Jiang
ICML (2024)
Preview abstract
We present VideoPoet, a language model capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs -- including images, videos, text, and audio. The training protocol follows that of Large Language Models (LLMs), consisting of two stages: pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal generative objectives within an autoregressive Transformer framework. The pretrained LLM serves as a foundation that can be adapted for a range of video generation tasks. We present empirical results demonstrating the model's state-of-the-art capabilities in zero-shot video generation, specifically highlighting VideoPoet's ability to generate high-fidelity motions. Project page: http://sites.research.google/videopoet/
View details