Publications
Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.
Sort By
1 - 15 of 10499 publications
LeakyFeeder: In-Air Gesture Control Through Leaky Acoustic Waves
Yongjie Yang
Tao Chen
Zhenlin An
Shirui Cao
Shangguan Longfei
SenSys 2025 - The 23rd ACM Conference on Embedded Networked Sensor Systems (2025)
Preview abstract
We present LeekyFeeder, a mobile application that explores the acoustic signals leaked from headphones to reconstruct gesture motions around the ear for fine-grained gesture control. To achieve this goal, LeekyFeeder reuses the speaker and feed-forward microphones on active noise cancellation (ANC) headphones as a SONAR system, emitting an inaudible frequency-modulated continuous-wave (FMCW) signal to track gesture reflections over time. Since this single-receiver SONAR system is unable to differentiate reflection angles and further disentangle signal reflections from different gesture parts, we draw on principles of multi-modal learning to frame gesture motion reconstruction as a multi-modal translation task and propose a deep learning-based approach to fill the information gap between low-dimensional FMCW ranging readings and high-dimensional 3D hand movements. We implement LeekyFeeder on a pair of Google Pixel Buds and conduct experiments to examine the efficacy and robustness of LeekyFeeder in various conditions. Experiments based on six gesture types inspired by Apple Vision Pro demonstrate that LeekyFeeder achieves a PCK performance of 89% at 3cm across ten users, with an average MPJPE and MPJRPE error of 2.71cm and 1.88cm, respectively.
View details
Preview abstract
Cardinality sketches are compact data structures that efficiently estimate the number of distinct elements across multiple queries while minimizing storage, communication, and computational costs. However, recent research has shown that these sketches can fail under adaptively chosen queries, breaking down after approximately $\tilde{O}(k^2)$ queries, where $k$ is the sketch size. In this work, we overcome this quadratic barrier by designing robust estimators with fine-grained guarantees. Specifically, our constructions can handle an exponential number of adaptive queries, provided that each element participates in at most $\tilde{O}(k^2)$ queries. This effectively shifts the quadratic barrier from the total number of queries to the number of queries sharing the same element, which can be significantly smaller. Beyond cardinality sketches, our approach expands the toolkit for robust algorithm design.
View details
The Case for Leveraging Transport Signals to Improve Internet Speed Test Efficiency
Cristina Leon
Computer Communication Review (2025) (to appear)
Preview abstract
Internet speed tests are an important tool to enable consumers and regulators to monitor the quality of Internet access. However, increased Internet speeds to the home and an increased demand for speed testing pose scaling challenges to providers of speed tests, who must maintain costly infrastructure to keep up with this demand. In recent years, this has led the popular NDT speed test to limit data transfer to a total of 250MB, which comes at the cost of accuracy for high bandwidth speed test clients.
In this paper, we observe that the NDT speed test server’s congestion control algorithm (BBRv1) is also trying to estimate the capacity of the connection. We leverage this observation and signals from BBR to improve the accuracy and efficiency of speed tests. We first show how leveraging signals from BBR can more than double the accuracy of a 10MB test–from 17% to 43%–for clients with speeds over 400Mbps.
We then show how using BBR signals to adaptively end the speed test reduces data transfer by 36% and increased accuracy by 13% for high bandwidth clients, relative to a 100MB fixed length test. Even accounting for clients that never observe enough samples to utilize the BBR signal, this adaptive approach still uses 25% less data than a fixed 100MB test with 37-44% higher accuracy.
View details
Improving simulation-based origin-destination demand calibration using sample segment counts data
Arwa Alanqary
Yechen Li
The 12th Triennial Symposium on Transportation Analysis conference (TRISTAN XII), Okinawa, Japan (2025)
Preview abstract
This paper introduces a novel approach to demand estimation that utilizes partial observations of segment-level track counts. Building on established simulation-based demand estimation methods, we present a modified formulation that integrates sample track counts as a regularization term. This approach effectively addresses the underdetermination challenge in demand estimation, moving beyond the conventional reliance on a prior OD matrix. The proposed formulation aims to preserve the distribution of the observed track counts while optimizing the demand to align with observed path-level travel times. We tested this approach on Seattle's highway network with various congestion levels. Our findings reveal significant enhancements in the solution quality, particularly in accurately recovering ground truth demand patterns at both the OD and segment levels.
View details
Preview abstract
Augmenting LLMs with context leads to improved performance across many applications. Despite much research on Retrieval Augmented Generation (RAG) systems, an open question is whether errors arise because LLMs fail to utilize the context from retrieval or the context itself is insufficient to answer the query. To shed light on this, we develop a new notion of sufficient context, along with a way to classify instances that have enough information to answer the query. We then use sufficient context to analyze several models and datasets. By stratifying errors based on context sufficiency, we find that proprietary LLMs (Gemini, GPT, Claude) excel at answering queries when the context is sufficient, but often output incorrect answers instead of abstaining when the context is not. On the other hand, open-source LLMs (Llama, Mistral, Gemma) hallucinate or abstain often, even with sufficient context. We further categorize cases when the context is useful, and improves accuracy, even though it does not fully answer the query and the model errs without the context. Building on our findings, we explore ways to reduce hallucinations in RAG systems, including a new selective generation method that leverages sufficient context information for guided abstention. Our method improves the fraction of correct answers among times where the model responds by 2--10% for Gemini, GPT, and Gemma.
View details
Scaling Laws for Downstream Task Performance in Machine Translation
Natalia Ponomareva
Hussein Hazimeh
Sanmi Koyejo
International Conference on Learning Representations (ICLR) (2025) (to appear)
Preview abstract
Scaling laws provide important insights that can guide the design of large language models (LLMs). Existing work has primarily focused on studying scaling laws for pretraining (upstream) loss. However, in transfer learning settings, in which LLMs are pretrained on an unsupervised dataset and then finetuned on a downstream task, we often also care about the downstream performance. In this work, we study the scaling behavior in a transfer learning setting, where LLMs are finetuned for machine translation tasks. Specifically, we investigate how the choice of the \emph{pretraining} data and its size affect downstream performance (translation quality) as judged by: downstream cross-entropy and translation quality metrics such as BLEU and COMET scores. Our experiments indicate that the size of the finetuning dataset and the distribution alignment between the pretraining and downstream data significantly influence the scaling behavior. With sufficient alignment, both downstream cross-entropy and translation quality scores improve monotonically with more pretraining data. In such cases, we show that it is possible to predict the downstream translation quality metrics with good accuracy using a log-law. However, there are cases where moderate misalignment causes the downstream translation scores to fluctuate or get worse with more pretraining, whereas downstream cross-entropy monotonically improves. By analyzing these, we provide new practical insights for choosing appropriate pretraining data.
View details
Preview abstract
Browser fingerprinting is an online tracking technique that is being increasingly adopted for profiling and ad targeting purposes. While prior work has analyzed the prevalence and impact of browser fingerprinting on the Web, they have traditionally relied on large-scale automated crawls. Naturally, these cannot replicate real-human interactions, e.g., solve CAPTCHAs, evade bot detectors, or operate behind login pages and paywalls. This prompts the question as to whether or not the fingerprinting ecosystem is appreciably different in real-world browsing sessions. In this paper, we begin to address this question by designing and conducting a user study aimed at collecting actual telemetry data from real browsing sessions of 30 users.
We find that almost half of the fingerprinting websites identified from real user browsing sessions are missed by equivalent automated crawls. This is mainly due to the inability of automated crawls to identify and visit authentication pages, being blocked by bot detectors, and/or failing to perform user interactions that specifically trigger browser fingerprinting scripts. We also find new fingerprinting vectors that are consistently present in fingerprinting scripts captured by real user browsing sessions yet missing from automated crawls. Finally, we assess the feasibility of collecting fingerprinting training data in a privacy-preserving way. We conclude that private models built on real user browsing sessions can detect browser fingerprinting more effectively than models trained on automated crawls alone, while simultaneously providing strong privacy guarantees to users.
View details
Record Number of Members Visit U.S. Congress to Talk Tech Policy
IEEE Spectrum (2025)
Preview abstract
This IEEE Spectrum article reflects on advocacy for U.S. technological leadership during my Congressional visit through IEEE-USA. Leading an expert group of other distinguished IEEE members, we urged lawmakers to support critical initiatives. Key priorities included sustained funding for federal research institutions like NIST, NASA, and the NSF, reauthorizing the SBIR/STTR programs vital for small business innovation, and passing the CREATE AI Act to democratize AI resources by establishing the National AI Research Resource (NAIRR).
We also emphasized strengthening the STEM talent pipeline through the CHIPS and Science Act and expanding high-skilled immigrant visas. We highlighted rapid AI advancements, such as autonomous vehicles, the surge in FDA-approved AI based medical devices, as underscoring the need for these strategic investments and policy actions. The article conveys a sense of urgency, calling for concrete congressional action to ensure the U.S. maintains its technological edge while also sharing my personal experiences.
View details
Preview abstract
Measuring software development can help drive impactful change. However, it’s a complex task, and getting started can be daunting as it involves understanding what you should measure, and determining what you can measure. This article provides a guide to selecting a framework that aligns with organizational measurement strategy.
View details
Deep Researcher with Test-time Diffusion
Guan Sun
Zoey CuiZhu
Yuanjun (Sophia) Bi
Weiming Wen
Hui Wan
Chunfeng Wen
Solène Maître
George Lee
Vishy Tirumalashetty
Emily Xue
Burak Gokturk
2025
Preview abstract
Deep research agents, powered by Large Language Models (LLMs), are rapidly advancing; yet, their performance often plateaus when generating complex, long-form research reports using generic test-time scaling algorithms. Drawing inspiration from the iterative nature of human research, which involves cycles of searching, reasoning, and revision, we propose the Test-Time Diffusion Deep Researcher (TTD-DR). This novel framework conceptualizes research report generation as a diffusion process. TTD-DR initiates this process with a preliminary draft, an updatable skeleton that serves as an evolving foundation to guide the research direction. The draft is then iteratively refined through a "denoising" process, which is dynamically informed by a retrieval mechanism that incorporates external information at each step. The core process is further enhanced by a self-evolutionary algorithm applied to each component of the agentic workflow, ensuring the generation of high-quality context for the diffusion process. This draft-centric design guides the report writing process to be more timely and coherent while reducing information loss during the iterative search process. We demonstrate that our TTD-DR achieves state-of-the-art results on a wide array of benchmarks that require intensive search and multi-hop reasoning, significantly outperforming existing deep research agents.
View details
A Scalable Framework for Evaluating Health Language Models
Neil Mallinar
Tony Faranesh
Brent Winslow
Nova Hammerquist
Ben Graef
Cathy Speed
Mark Malhotra
Shwetak Patel
Xavi Prieto
Daniel McDuff
Ahmed Metwally
(2025)
Preview abstract
Large language models (LLMs) have emerged as powerful tools for analyzing complex datasets. Recent studies demonstrate their potential to generate useful, personalized responses when provided with patient-specific health information that encompasses lifestyle, biomarkers, and context. As LLM-driven health applications are increasingly adopted, rigorous and efficient one-sided evaluation methodologies are crucial to ensure response quality across multiple dimensions, including accuracy, personalization and safety. Current evaluation practices for open-ended text responses heavily rely on human experts. This approach introduces human factors and is often cost-prohibitive, labor-intensive, and hinders scalability, especially in complex domains like healthcare where response assessment necessitates domain expertise and considers multifaceted patient data. In this work, we introduce Adaptive Precise Boolean rubrics: an evaluation framework that streamlines human and automated evaluation of open-ended questions by identifying gaps in model responses using a minimal set of targeted rubrics questions. Our approach is based on recent work in more general evaluation settings that contrasts a smaller set of complex evaluation targets with a larger set of more precise, granular targets answerable with simple boolean responses. We validate this approach in metabolic health, a domain encompassing diabetes, cardiovascular disease, and obesity. Our results demonstrate that Adaptive Precise Boolean rubrics yield higher inter-rater agreement among expert and non-expert human evaluators, and in automated assessments, compared to traditional Likert scales, while requiring approximately half the evaluation time of Likert-based methods. This enhanced efficiency, particularly in automated evaluation and non-expert contributions, paves the way for more extensive and cost-effective evaluation of LLMs in health.
View details
Preview abstract
We propose moving from Software Defined Networks (SDN) to Software Managed Networks (SMN) where all information for managing the life cycle of a network (from deployment to operations to upgrades), across all layers (from Layer 1 through 7) is stored in a central repository. Crucially, a SMN also has a generalized control plane that, unlike SDN, controls all aspects of the cloud including traffic management (e.g., capacity planning) and reliability (e.g., incident routing) at both short (minutes) and large (years) time scales. Just as SDN allows better routing, a SMN improves visibility and enables cross-layer optimizations for faster response to failures and better network planning and operations. Implemented naively, SMN for planetary scale networks requires orders of magnitude larger and more heterogeneous data (e.g., alerts, logs) than SDN. We address this using coarsening - mapping complex data to a more compact abstract representation that has approximately the same effect, and is more scalable, maintainable, and learnable. We show examples including Coarse Bandwidth Logs for capacity planning and Coarse Dependency Graphs for incident routing. Coarse Dependency Graphs improve an incident routing metric from 45% to 78% while for a distributed approach like Scouts the same metric was 22%. We end by discussing how to realize SMN, and suggest cross-layer optimizations and coarsenings for other operational and planning problems in networks.
View details
Generative Quantum Advantage for Classical and Quantum Problems
Robert Huang
Michael Broughton
Norhan Eassa
arXiv:2509.09033 (2025)
Preview abstract
Recent breakthroughs in generative machine learning, powered by massive computational resources, have demonstrated unprecedented human-like capabilities. While beyond-classical quantum experiments can generate samples from classically intractable distributions, their complexity has thwarted all efforts toward efficient learning. This challenge has hindered demonstrations of generative quantum advantage: the ability of quantum computers to learn and generate desired outputs substantially better than classical computers. We resolve this challenge by introducing families of generative quantum models that are hard to simulate classically, are efficiently trainable, exhibit no barren plateaus or proliferating local minima, and can learn to generate distributions beyond the reach of classical computers. Using a 68-qubit superconducting quantum processor, we demonstrate these capabilities in two scenarios: learning classically intractable probability distributions and learning quantum circuits for accelerated physical simulation. Our results establish that both learning and sampling can be performed efficiently in the beyond-classical regime, opening new possibilities for quantum-enhanced generative models with provable advantage.
View details
Our Approach to Protecting AI Training Data
Cindy Muya
Jason Novak
Cindee Madison
Reiner Critides
Ben Kamber
Niha Vempati
Jeremy Wiesner
, Google, 1600 Amphitheatre Parkway, Mountain View, CA, 94043 (2025)
Preview abstract
Google has over 25 years experience protecting data from inappropriate access and unauthorized use. In the era of AI, Google has extended these best practices in data protection to ensure that the right data is used the right way to train models. This paper presents a number of these best practices, describes how Google applies them in its systems, and describes how Google Cloud customers can use Google Cloud capabilities to implement these practices themselves.
Protecting data requires both technical controls to enable safe data use at scale, and governance processes to ensure that companies have visibility and control over how their data is used. This fundamentally requires: understanding data and ensuring it has sufficient metadata in the form of attributes, controlling the data and implementing policies to allow (or disallow) certain usage based on those attributes, transforming data to enable its usage in policy compliant ways, and human oversight and governance.
Protecting data in AI inherits these requirements and introduces new requirements to account for unique AI-specific risks including memorization/recitation and the costs of training foundational models. Meeting these new risks requires new capabilities including enhanced understanding of data and model lineage as well as an increased ability to control data usage through checks on data for policy compliance at the time a training job is configured before it is run.
This white paper offers an in-depth look at data protection best practices and Google’s data protection capabilities, and is one of a series of publications about Google's Secure AI Framework (SAIF). Building upon its secure development practices, Google has developed and deployed a number of capabilities to understand, control, and transform data in its infrastructure so that data is both protected and used appropriately. This involves robust annotation systems to represent metadata and enable granular understanding of data at both an item and dataset level, policy engines that evaluate machine readable policies on that data using the metadata attributes, and sensors to understand how data is flowing across Google’s systems and raise alerts when policy violations occur. Moreover, Google has developed de-identification and anonymization systems to transform data to make it policy compliant and safer to use for AI training.
View details
Circadian rhythm of heart rate and activity: a cross-sectional study
Maryam Khalid
Logan Schneider
Aravind Natarajan
Conor Heneghan
Karla Gleichauf
Chronobiology International (2025)
Preview abstract
ABSTRACT
Background: Circadian rhythms are commonly observed in a number of physiological processes. Consumer wearable devices have made it possible to obtain continuous time series data from a large number of individuals. We study circadian rhythms from measurements of heart rate, movement, and sleep, from a cohort of nearly 20,000 participants over the course of 30 days.
Methods: Participation was restricted to Fitbit users of age 21 years or older residing in the United States or Canada. Participants were enrolled through a recruitment banner shown on the Fitbit App. The advertisement was shown to 531,359 Fitbit users, and 23,239 enrolled in the program. Of these, we obtained heart rate data from 19,350 participants. We obtain the underlying circadian rhythm from time series heart rate by modeling the circadian rhythm as a sum over the first two Fourier harmonics. The first Fourier harmonic accounts for the 24-hour rhythmicity, while the second harmonic accounts for non-sinusoidal perturbations.
Findings: We observe a circadian rhythm in both heart rate and acceleration. From the diurnal modulation, we obtain the following circadian parameters: (i) amplitude of modulation, (ii) bathyphase, (iii) acrophase, (iv) non-sinusoidal fraction, and (v) fraction of day when the heart rate is greater than the mean. The amplitude, bathyphase, and acrophase depend on sex, and decrease with age. The waketime on average, follows the bathyphase by 2.4 hours. In most individuals, the circadian rhythm of heart rate lags the circadian rhythm of activity.
Interpretation: Circadian metrics for heart rate and activity can be reliably obtained from commercially available wearable devices. Distributions of circadian metrics can be valuable tools for individual-level interpretation.
View details