Publications
Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.
Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.
Sort By
1 - 15 of 10822 publications
mmMUSE: An mmWave-based Motion-resilient Universal Speech Enhancement System
Chenming He
Yanyong Zhang
Kai Wang
Dequan Wang
Lingyu Wang
the Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT), ACM (2026) (to appear)
Preview abstract
Voice-based smart systems can greatly enhance user experiences by allowing higher-quality interactions through better voice perception. Speech enhancement can benefit such systems by isolating noise from speech. Recently, integrating millimeter-wave (mmWave) with audio for speech perception has gained increasing attention due to microphones' limitations in noisy environments. However, mmWave-based vocal extraction is severely affected by motion, which disperses vocal signals across ranges and introduces distortions. In this paper, we propose an mmWave-based motion-resilient universal speech enhancement system called mmMUSE, which fuses mmWave and audio signals. To mitigate motion interference, we develop a Doppler-based method for motion-robust vocal signal extraction. Moreover, by introducing the Vocal-Noise-Ratio metric to assess the prominence of vocal signals from mmWave, we achieve real-time voice activity detection that gains 3.81 dB of SISDR in noisy speeches. Additionally, we design a two-stage complex-valued network that includes an attention-based fusion network for cross-modal complementing and a time-frequency masking network for correcting amplitude and phase of speech to isolate noises.
Using mmWave and audio datasets from 46 participants, mmMUSE outperforms the state-of-the-art speech enhancement models, achieving an average SISDR improvement of 3.12 dB. Additionally, mmMUSE achieves SISDR improvements of 16.51 dB, 17.93 dB, 14.93 dB, and 18.95 dB in controlled environments involving intense noise, extensive motion, multiple speakers, and various obstructive materials, respectively. Finally, we evaluate mmMUSE in real-world scenarios including running, public spaces, and driving, maintaining a word error rate (WER) below 10%.
View details
Preview abstract
AI coding assistants are rapidly becoming integral to modern software development. A key challenge in this space is the continual need to migrate and modernize codebases in response to evolving software ecosystems. Traditionally, such migrations have relied on rule-based systems and human intervention. With the advent of powerful large language models (LLMs), AI-driven agentic frameworks offer a promising alternative—but their effectiveness remains underexplored. In this paper, we introduce FreshBrew, a novel benchmark for evaluating AI-based agentic frameworks on project-level Java migrations. We benchmark several such frameworks, powered by state-of-the-art LLMs, and compare their performance against established rule-based tools. Our evaluation of AI agents on this benchmark of 228 repositories shows that the top-performing model, Gemini 2.5 Flash, can successfully migrate 56.5% of projects to JDK 17. Our empirical analysis reveals novel insights into the critical strengths and limitations of current agentic approaches, offering actionable insights into their real-world applicability. By releasing FreshBrew publicly upon acceptance, we aim to facilitate rigorous, reproducible evaluation and catalyze progress in AI-driven codebase modernization.
View details
Preview abstract
For many practical applications of quantum computing, the slowest and most costly steps involve coherently accessing classical data. We help address this challenge by applying mass production techniques, which can sometimes allow us to perform operations many times in parallel for a cost that is comparable to a single execution[1-3]. We combine existing mass-production results with modern approaches for loading classical data using ``quantum read-only memory.'' We show that quantum mass production techniques offer no benefit when we consider a cost model that focuses purely on the number of non-Clifford gates. However, analyzing the constant factors in a more nuanced cost model, we find that it may be possible to obtain a reduction in cost of an order or magnitude or more for a variety reasonably-sized fault-tolerant quantum algorithms. We present several applications of quantum mass-production techniques beyond naive parallelization, including a strategy for reducing the cost of serial calls to the same data loading step.
View details
Preview abstract
Buffered Linear Toeplitz (BLT) matrices are a family of parameterized lower-triangular matrices that play an important role in streaming differential privacy with correlated noise. Our main result is a BLT inversion theorem: the inverse of a BLT matrix is itself a BLT matrix with different parameters. We also present an efficient and differentiable O(d^3) algorithm to compute the parameters of the inverse BLT matrix, where d is the degree of the original BLT (typically d < 10). Our characterization enables direct optimization of BLT parameters for privacy mechanisms through automatic differentiation.
View details
CountQA: How Well Do MLLMs Count in the Wild?
Jayant Tamarapalli
Rynaa Grover
Nilay Pande
Sahiti Yerramilli
arXiv preprint arXiv:2508.06585 (2025)
Preview abstract
While Multimodal Large Language Models (MLLMs) display a remarkable fluency in describing visual scenes, their ability to perform the fundamental task of object counting remains poorly understood. This paper confronts this issue by introducing CountQA, a challenging new benchmark composed of over 1,500 question-answer pairs centered on images of everyday, real-world objects, often in cluttered and occluded arrangements. Our evaluation of 15 prominent MLLMs on CountQA systematically investigates this weakness, revealing a critical failure of numerical grounding: the models consistently struggle to translate raw visual information into an accurate quantity. By providing a dedicated tool to probe this foundational weakness, CountQA paves the way for the development of more robust and truly capable MLLMs that are spatially aware and numerically grounded.
View details
Preview abstract
Computer use agents (CUAs) need to plan long-horizon task workflows grounded in diverse, ever-changing applications and environments, but learning is hindered by the scarcity of large-scale, high-quality training data. Existing datasets are small, domain-specific, and costly to annotate, while current synthetic data generation methods often yield brittle, simplistic, or misaligned task demonstrations.
We introduce Watch & Learn (W&L), a framework that transforms human demonstration videos available in the Internet into executable UI trajectories at scale. Inspired by robotics, we train an inverse dynamics model that accurately predicts user actions from consecutive screens, bypassing the need for complex heuristics. To scale to the web, we curate a large state-transition corpus and design a retrieval framework that identifies relevant video tutorials, enabling automatic conversion of raw videos into structured UI trajectories without requiring manual annotations. Beyond training data, we show that the generated UI trajectories can also serve as in-context exemplars, providing CUAs with long-horizon priors and domain-specific knowledge at inference time.
On the challenging OSWorld and Mind2Web benchmarks, UI trajectories extracted with W&L consistently improve both general-purpose and state-of-the-art frameworks when used in-context, and delivers stronger gains for open-source models when used in training. These results highlight web-scale human demonstration videos as a practical and scalable foundation for advancing CUAs towards real-world deployment.
View details
Preview abstract
We investigate Learning from Label Proportions (LLP), a partial information setting where examples in a training set are grouped into bags, and only aggregate label values in each bag are available. Despite the partial observability, the goal is still to achieve small regret at the level of individual examples. We give results on the sample complexity of LLP under square loss, showing that our sample complexity is essentially optimal. From an algorithmic viewpoint, we rely on carefully designed variants of Empirical Risk Minimization, and Stochastic Gradient Descent algorithms, combined with ad hoc variance reduction techniques. On one hand, our theoretical results improve in important ways on the existing literature on LLP, specifically in the way the sample complexity depends on the bag size. On the other hand, we validate our algorithmic solutions on several datasets, demonstrating improved empirical performance (better accuracy for less samples) against recent baselines.
View details
Position Paper: Graph Learning Loses Relevance Due To Poor Benchmarks
Maya Bechler-Speicher
Ben Finkelshtein
Fabrizio Frasca
Luis Müller
Jan Tönshoff
Antoine Siraudin
Viktor Zaverkin
Michael Bronstein
Mathias Niepert
Michael Galkin
Christopher Morris
2025
Preview abstract
While machine learning on graphs has demonstrated promise in drug design and molecular property prediction, significant benchmarking challenges hinder its further progress and relevance. Current benchmarking practices often lack focus on transformative, real-world applications, favoring narrow domains like two-dimensional molecular graphs over broader, impactful areas such as combinatorial optimization, relational databases, or chip design. Additionally, many benchmark datasets poorly represent the underlying data, leading to inadequate abstractions and misaligned use cases. Fragmented evaluations and an excessive focus on accuracy further exacerbate these issues, incentivizing overfitting rather than fostering generalizable insights. These limitations have prevented the development of truly useful graph foundation models. This position paper calls for a paradigm shift toward more meaningful benchmarks, rigorous evaluation protocols, and stronger collaboration with domain experts to drive impactful and reliable advances in graph learning research, unlocking the potential of graph learning.
View details
Preview abstract
As large language models (LLMs) improve in their capacity to serve as personal AI assistants, their ability to output uniquely tailored, personalized responses that align with the soft preferences of their users is imperative for maximizing user satisfaction and retention. However, lay users are notoriously bad at prompt specification and often struggle with conveying their latent preferences to AI assistants. To resolve this, we demonstrate that activation steering, an inference-time method, can effectively control the response of the LLMs towards expressing different preferences. In contrast to memory-based personalization methods that require long user history, steering is extremely lightweight and easily-controllable via an interpretable linear strength factor. We further conduct a within-subjects user study (n=14) to investigate how end users personalize their conversations through three different steerable chatbot interfaces. The results demonstrate the effectiveness of preference-based steering for aligning real-world conversations with user preferences, and we discuss qualitative findings on how diverse values around control, transparency, and usability of personalization lead users to prefer different interfaces.
View details
Preview abstract
The quest to identify quantum advantages, where quantum physics truly outperforms classical physics, lies at the heart of quantum technology. While quantum devices promise extraordinary capabilities, from exponential computational speedups to unprecedented measurement precision, distinguishing genuine advantages from mere illusions remains a formidable challenge. In this endeavor, quantum theorists are like prophets trying to foretell a future where quantum technologies reign supreme. Yet, the boundary between visionary insight and unfounded fantasy is perilously thin. In this perspective, we explore the properties defining an ideal quantum advantage and examine our mathematical tools for navigating the vast world of quantum advantages across computation, learning, sensing, communication, and beyond. We show that some quantum advantages are inherently unpredictable using classical resources alone, suggesting a landscape far richer than what we can currently foresee. While mathematical rigor remains our indispensable guide in this exploration, the ultimate power of quantum technologies may emerge from the quantum advantages we cannot yet conceive.
View details
EQuARX: Efficient Quantized AllReduce in XLA for Distributed Machine Learning Acceleration
Preview
Ibrahim Ahmed
Clemens Schaefer
Denis Vnukov
Zenong Zhang
Toli Yevtushenko
2025
Preview abstract
We study a variant of the job-shop scheduling problem, where unit-sized jobs must be assigned to a single machine, but the capacity of the machine changes over time.
View details
Preview abstract
Fine-tuning language models (LMs) with the standard Adam optimizer often demands excessive memory, limiting accessibility. The "in-place" version of Stochastic Gradient Descent (IP-SGD) and Memory-Efficient Zeroth-order Optimizer (MeZO) have been proposed as solutions to improve memory efficiency. However, IP-SGD still requires a decent amount of memory, and MeZO suffers from slow convergence and degraded final performance due to its zeroth-order nature. This paper introduces Addax, a novel method that improves both memory efficiency and algorithm performance of IP-SGD by integrating it with MeZO. Specifically, Addax computes the zeroth-order or first-order gradient of the data points in the mini-batch based on their memory consumption and combines zeroth- and first-order gradient estimates to obtain the updated direction in each step. By computing the zeroth-order order gradient of data points that require more memory and the first-order gradient of the ones that require less memory, Addax overcomes the slow convergence of MeZO and excessive memory requirement of IP-SGD. Additionally, the zeroth-order gradient acts as a regularizer for the first-order gradient, further enhancing the model's final performance. Theoretically, we establish the convergence of Addax under mild assumptions, demonstrating faster convergence and less restrictive hyperparameter choices than MeZO. Our extensive experiments with diverse LMs and tasks show that Addax consistently outperforms MeZO in terms of accuracy and convergence speed, while having a comparable memory footprint. In particular, our experiments using one A100 GPU on OPT-13B model reveal that, on average, Addax outperforms MeZO in terms of accuracy/F1 score by 14%, and runs 15x faster, while having a comparable memory footprint to MeZO. In our experiments on the larger OPT-30B model, on average, Addax outperforms MeZO in terms of accuracy/F1 score by >16% and runs 30x faster on a single H100 GPU. Moreover, Addax surpasses the performance of standard fine-tuning approaches, such as IP-SGD and Adam, in most tasks in terms of Accuracy/F1 score with significantly less memory requirement.
View details
Binamix -- A Python Library for Generating Binaural Audio Datasets
Dan Barry
Davoud Shariat Panah
Alessandro Ragano
Andrew Hines
AES 158th Audio Engineering Society Convention (2025)
Preview abstract
The increasing demand for spatial audio in applications such as virtual reality, immersive media, and spatial audio research necessitates robust solutions to generate binaural audio data sets for use in testing and validation. Binamix is an open-source Python library designed to facilitate programmatic binaural mixing using the extensive SADIE II Database, which provides Head Related Impulse Response (HRIR) and Binaural Room Impulse Response (BRIR) data for 20 subjects. The Binamix library provides a flexible and repeatable framework for creating large-scale spatial audio datasets, making it an invaluable resource for codec evaluation, audio quality metric development, and machine learning model training. A range of pre-built example scripts, utility functions, and visualization plots further streamline the process of custom pipeline creation. This paper presents an overview of the library’s capabilities, including binaural rendering, impulse response interpolation, and multi-track mixing for various speaker layouts. The tools utilize a modified Delaunay triangulation technique to achieve accurate HRIR/BRIR interpolation where desired angles are not present in the data. By supporting a wide range of parameters such as azimuth, elevation, subject Impulse Responses (IRs), speaker layouts, mixing controls, and more, the library enables researchers to create large binaural datasets for any downstream purpose. Binamix empowers researchers and developers to advance spatial audio applications with reproducible methodologies by offering an open-source solution for
binaural rendering and dataset generation. We release the library under the Apache 2.0 License at https://github.com/QxLabIreland/Binamix/
View details
Society-Centric Product Innovation in the Era of Customer Obsession
International Journal of Science and Research Archive (IJSRA), Volume 14 - Issue 1 (2025)
Preview abstract
This article provides a comprehensive analysis of the evolving landscape of innovation in the technology sector, with a focus on the intersection of technological progress and social responsibility. The article explores key challenges facing the industry, including public trust erosion, digital privacy concerns, and the impact of automation on workforce dynamics. It investigates responsible innovation frameworks' emergence and implementation across various organizations, highlighting the transformation from traditional development approaches to more society-centric models. The article demonstrates how companies balance innovation speed with social responsibility, incorporate ethical considerations into their development processes, and address digital disparities across different demographics. By examining how companies balance the pace of innovation with ethical responsibilities, integrate social considerations into their processes, and address digital inequities across diverse demographics, the article underscores the transformative potential of these frameworks. Through insights into cross-functional teams, impact assessment tools, and stakeholder engagement strategies, it demonstrates how responsible innovation drives both sustainable business value and societal progress.
View details