Publications

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

people standing in front of a screen with images and a chipboard

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
1 - 15 of 10133 publications
    Preview abstract WindowMirror is a framework for using XR headsets in productivity scenarios. The toolkit provides users with a simulated, extended screen real-estate. It allows users to interact with multiple desktop applications in real-time within a XR environment. Our architecture has two main modules: one a Unity package and a Python backend, which makes it easy to use and extend. WindowMirror supports traditional desktop interaction methods such as mouse, keyboard, and hand tracking. Furthermore, it features a Cylindrical Window Layout, an emerging design pattern which is particularly effective for single-user, egocentric perspectives. The introduction of WindowMirror aims to set a foundation for future research in XR screen-focused productivity scenarios. View details
    Preview abstract Motivated by recent advances in large language models for NLP, we design a time-series foundation model for forecasting whose out-of-the-box zero-shot performance on a variety of datasets, matches the accuracy of state-of-the-art supervised forecasting models for each individual dataset. Our model is based on pretraining a patched-decoder style attention model on a large time series dataset, and can work well across different forecasting history lengths, prediction lengths and temporal granularities. View details
    Preview abstract In this work we investigate the impact of a large-scale self-supervised pretraining strategy for active speaker detection (ASD) on an unlabeled dataset consisting of over 125k hours of YouTube videos. When compared to a baseline trained from scratch on much smaller in-domain labeled datasets we show that with pretraining we not only have a more stable supervised training due to better audio-visual features used for initialization, but also improve the ASD mean average precision by 23\% on a challenging dataset collected with Google Nest Hub Max devices capturing real user interactions. View details
    Stable quantum-correlated many-body states through engineered dissipation
    Xiao Mi
    Alexios Michailidis
    Sara Shabani
    Jerome Lloyd
    Rajeev Acharya
    Igor Aleiner
    Trond Andersen
    Markus Ansmann
    Frank Arute
    Kunal Arya
    Juan Atalaya
    Gina Bortoli
    Alexandre Bourassa
    Leon Brill
    Michael Broughton
    Bob Buckley
    Tim Burger
    Nicholas Bushnell
    Jimmy Chen
    Benjamin Chiaro
    Desmond Chik
    Charina Chou
    Josh Cogan
    Roberto Collins
    Paul Conner
    William Courtney
    Alex Crook
    Ben Curtin
    Alejo Grajales Dau
    Dripto Debroy
    Agustin Di Paolo
    ILYA Drozdov
    Andrew Dunsworth
    Lara Faoro
    Edward Farhi
    Reza Fatemi
    Vinicius Ferreira
    Ebrahim Forati
    Brooks Foxen
    Élie Genois
    William Giang
    Dar Gilboa
    Raja Gosula
    Steve Habegger
    Michael Hamilton
    Monica Hansen
    Sean Harrington
    Paula Heu
    Markus Hoffmann
    Trent Huang
    Ashley Huff
    Bill Huggins
    Sergei Isakov
    Justin Iveland
    Cody Jones
    Pavol Juhas
    Kostyantyn Kechedzhi
    Marika Kieferova
    Alexei Kitaev
    Andrey Klots
    Alexander Korotkov
    Fedor Kostritsa
    John Mark Kreikebaum
    Dave Landhuis
    Pavel Laptev
    Kim Ming Lau
    Lily Laws
    Joonho Lee
    Kenny Lee
    Yuri Lensky
    Alexander Lill
    Wayne Liu
    Orion Martin
    Amanda Mieszala
    Shirin Montazeri
    Alexis Morvan
    Ramis Movassagh
    Wojtek Mruczkiewicz
    Charles Neill
    Ani Nersisyan
    Michael Newman
    JiunHow Ng
    Murray Ich Nguyen
    Tom O'Brien
    Alex Opremcak
    Andre Petukhov
    Rebecca Potter
    Leonid Pryadko
    Charles Rocque
    Negar Saei
    Kannan Sankaragomathi
    Henry Schurkus
    Christopher Schuster
    Mike Shearn
    Aaron Shorter
    Noah Shutty
    Vladimir Shvarts
    Jindra Skruzny
    Clarke Smith
    Rolando Somma
    George Sterling
    Doug Strain
    Marco Szalay
    Alfredo Torres
    Guifre Vidal
    Cheng Xing
    Jamie Yao
    Ping Yeh
    Juhwan Yoo
    Grayson Young
    Yaxing Zhang
    Ningfeng Zhu
    Jeremy Hilton
    Anthony Megrant
    Yu Chen
    Vadim Smelyanskiy
    Dmitry Abanin
    Science, 383 (2024), pp. 1332-1337
    Preview abstract Engineered dissipative reservoirs have the potential to steer many-body quantum systems toward correlated steady states useful for quantum simulation of high-temperature superconductivity or quantum magnetism. Using up to 49 superconducting qubits, we prepared low-energy states of the transverse-field Ising model through coupling to dissipative auxiliary qubits. In one dimension, we observed long-range quantum correlations and a ground-state fidelity of 0.86 for 18 qubits at the critical point. In two dimensions, we found mutual information that extends beyond nearest neighbors. Lastly, by coupling the system to auxiliaries emulating reservoirs with different chemical potentials, we explored transport in the quantum Heisenberg model. Our results establish engineered dissipation as a scalable alternative to unitary evolution for preparing entangled many-body states on noisy quantum processors. View details
    AI-Enhanced API Design: A New Paradigm in Usability and Efficiency
    Mak Ahmad
    David R Karger
    Kwan-Liu Ma
    CHI EA '24: Extended Abstracts of the 2024 CHI Conference on Human Factors in Computing Systems (2024)
    Preview abstract This study uses mixed methods to evaluate API design methods, focusing on design and consumption phases. Our goal was to understand the impact of API governance approaches on productivity and usability. A controlled developer experiment (n=34) demonstrated a 10% increased requirement fulfillment using API Improvement Proposals (AIPs) and linter versus no protocols. Meanwhile, 73% of 33 surveyed API consumers preferred AIP-aligned designs for enhanced usability and comprehensibility. Complementing this, a custom large language model called the API Architect received average expert ratings of just 5/10 for specification quality, revealing gaps versus manual design. The quantitative performance metrics combined with qualitative user feedback provide evidence from multiple angles that strategically integrating industry best practices with maturing AI capabilities can meaningfully improve API design outcomes. This research offers empirical insights from developer and consumer perspectives to advance scholarly discourse and industry practice regarding optimal API design workflows. View details
    Preview abstract Specialized Large multi-modal models (LMMs) have exhibited remarkable performance across numerous tasks, however, generalist LMMs suffer from performance degradation when training with a large collection of tasks. Recent research suggests Mixture of Experts (MoE) Models help instruction tuning, however, for LMMs of parameter size around O(50-100B), the prohibitive cost of replicating and storing the expert models severely limits the number of experts we can use. We propose Omni-SMoLA that softly mixes many multimodal low rank experts to large models without introducing significant new parameter count compared to conventional MoE models. The core idea is that the large model provides a foundational backbone and different lightweight experts learn specialized knowledge residually. Extensive experiments demonstrate that the SMoLA approach helps improve the generalist performance across a broad range of visual question answering and captioning tasks, achieving a new state-of-the-art generalist performance that matches or outperforms single specialized LMM baselines. View details
    Preview abstract Machine learning has a pseudoscience problem. An abundance of ethical issues arising from the use of machine learning (ML)-based technologies—by now, well documented—is inextricably entwined with the systematic epistemic misuse of these tools. We take a recent resurgence of deep learning-assisted physiognomic research as a case study in the relationship between ML-based pseudoscience and attendant social harms—the standard purview of “AI ethics.” In practice, the epistemic and ethical dimensions of ML misuse often arise from shared underlying reasons and are resolvable by the same pathways. Recent use of ML toward the ends of predicting protected attributes from photographs highlights the need for philosophical, historical, and domain-specific perspectives of particular sciences in the prevention and remediation of misused ML. View details
    Preview abstract Modern text-to-image generation models produce high-quality images that are both photorealistic and faithful to the text prompts. However, this quality comes at significant computational cost: nearly all of these models are iterative and require running sampling multiple times with large models. This iterative process is needed to ensure that different regions of the image are not only aligned with the text prompt, but also compatible with each other. In this work, we propose a light-weight approach to achieving this compatibility between different regions of an image, using a Markov Random Field (MRF) model. We demonstrate the effectiveness of this method on top of the latent token-based Muse text-to-image model. The MRF richly encodes the compatibility among image tokens at different spatial locations to improve quality and significantly reduce the required number of Muse sampling steps. Inference with the MRF is significantly cheaper, and its parameters can be quickly learned through back-propagation by modeling MRF inference as a differentiable neural-network layer. Our full model, MarkovGen, uses this proposed MRF model to both speed up Muse by 1.5X and produce higher quality images by decreasing undesirable image artifacts. View details
    Preview abstract This paper presents NOMAD (Non-Matching Audio Distance), a differentiable perceptual similarity metric that measures the distance of a degraded signal against non-matching references. The proposed method is based on learning deep feature embeddings via a triplet loss guided by the Neurogram Similarity Index Measure (NSIM) to capture degradation intensity. During inference, the similarity score between any two audio samples is computed through Euclidean distance of their embeddings. NOMAD is fully unsupervised and can be used in general perceptual audio tasks for audio analysis e.g. quality assessment and generative tasks such as speech enhancement and speech synthesis. The proposed method is evaluated with 3 tasks. Ranking degradation intensity, predicting speech quality, and as a loss function for speech enhancement. Results indicate NOMAD outperforms other non-matching reference approaches in both ranking degradation intensity and quality assessment, exhibiting competitive performance with full-reference audio metrics. NOMAD demonstrates a promising technique that mimics human capabilities in assessing audio quality with non-matching references to learn perceptual embeddings without the need for human-generated labels. View details
    Visual Program Tuning: Training Large Multimodal Models to Reason like Programs
    Yushi Hu
    Krishna Viswanathan
    Kenji Hata
    Enming Luo
    Ranjay Krishna
    Ariel Fuxman
    Conference on Computer Vision and Pattern Recognition (2024)
    Preview abstract Solving complex visual tasks (e.g., “Who invented the musical instrument on the right?”) involves back-and-forth between visual processing and reasoning. Visual programming is a recent multimodal framework that has shown promise in conducting visual reasoning in an interpretable and compositional manner. However, this framework is error-prone—it can lead to a wrong answer whenever the program itself is wrong, or when any of the steps of the program are solved incorrectly, thus leading to worse overall performance than end-to-end systems trained with labeled data. Moreover, it is inefficient to involve multiple steps (i.e., generating and then running programs) during inference. Ideally, a single large multimodal model (LMM) should directly conduct similar reasoning and yield the correct answer. In this work, we propose Visual Program Tuning (VPT), which leverages visual programs for teaching LLMs to reason via instruction tuning. VPT rewrites the execution traces of visual programs as chain-of-thought reasoning steps, and tunes an LMM to output not only the label but its reasoning as well. Extensive experiments on complex vision tasks show that models trained with VPT achieve state-of-the-art accuracy while being able to produce interpretable and faithful reasoning steps. PaLI-X + VPT outperforms all existing LMMs on a wide range of visual tasks, improving performance on counting, spatial relations, and compositional reasoning tasks. VPT is also helpful for quick adaptation on new tasks. Our experiments on content moderation show that fine-tuning LMMs with program-augmented examples is more sample efficient than traditional supervised training. View details
    Preview abstract This paper presents a novel approach to train a direct speech-to-speech translation model from monolingual datasets only in a fully unsupervised manner. The proposed approach combines back-translation, denoising autoencoder, and unsupervised embedding mapping techniques to achieve this goal. We demonstrate the effectiveness of the proposed approach by comparing it against a cascaded baseline using two Spanish and English datasets. The proposed approach achieved a significant improvement over the cascaded baseline on synthesized unpaired conversational and synthesized Common Voice $11$ datasets. View details
    Privacy-Preserving Instructions for Aligning Large Language Models
    Da Yu
    Sewoong Oh
    International Conference on Machine Learning (ICML) (2024)
    Preview abstract Service providers of large language model (LLM) applications collect user instructions in the wild and use them in further aligning LLMs with users’ intentions. These instructions, which potentially contain sensitive information, are annotated by human workers in the process. This poses a new privacy risk not addressed by the typical private optimization. To this end, we propose using synthetic instructions to replace real instructions in data annotation and model fine-tuning. Formal differential privacy is guaranteed by generating those synthetic instructions using privately fine-tuned generators. Crucial in achieving the desired utility is our novel filtering algorithm that matches the distribution of the synthetic instructions to that of the real ones. In both supervised fine-tuning and reinforcement learning from human feedback, our extensive experiments demonstrate the high utility of the final set of synthetic instructions by showing comparable results to real instructions. In supervised fine-tuning, models trained with private synthetic instructions outperform leading open-source models such as Vicuna View details
    Dynamic Inference of Likely Symbolic Tensor Shapes in Python Machine Learning Programs
    Koushik Sen
    International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP) (2024) (to appear)
    Preview abstract In machine learning programs, it is often tedious to annotate the dimensions of shapes of various tensors that get created during execution. We present a dynamic likely tensor shape inference analysis that annotates the dimensions of shapes of tensor expressions with symbolic dimension values. Such annotations can be used for understanding the machine learning code written in popular frameworks, such as TensorFlow, PyTorch, JAX, and for finding bugs related to tensor shape mismatch. View details
    AGILE3D: Attention Guided Interactive Multi-object 3D Segmentation
    Yuanwen Yue
    Sabarinath Mahadevan
    Jonas Schult
    Francis Engelmann
    Bastian Leibe
    Konrad Schindler
    Theodora Kontogianni
    ICLR (2024)
    Preview abstract During interactive segmentation, a model and a user work together to delineate objects of interest in a 3D point cloud. In an iterative process, the model assigns each data point to an object (or the background), while the user corrects errors in the resulting segmentation and feeds them back into the model. The current best practice formulates the problem as binary classification and segments objects one at a time. The model expects the user to provide positive clicks to indicate regions wrongly assigned to the background and negative clicks on regions wrongly assigned to the object. Sequentially visiting objects is wasteful since it disregards synergies between objects: a positive click for a given object can, by definition, serve as a negative click for nearby objects. Moreover, a direct competition between adjacent objects can speed up the identification of their common boundary. We introduce AGILE3D, an efficient, attention-based model that (1) supports simultaneous segmentation of multiple 3D objects, (2) yields more accurate segmentation masks with fewer user clicks, and (3) offers faster inference. Our core idea is to encode user clicks as spatial-temporal queries and enable explicit interactions between click queries as well as between them and the 3D scene through a click attention module. Every time new clicks are added, we only need to run a lightweight decoder that produces updated segmentation masks. In experiments with four different 3D point cloud datasets, AGILE3D sets a new state-of-the-art. Moreover, we also verify its practicality in real-world setups with real user studies. Project page: https://ywyue.github.io/AGILE3D. View details
    BEYOND THE CODE: AI REGULATIONS AS THE SECRET COMPASS OF ENGINEERING MANAGERS
    Proceedings of the American Society for Engineering Management 2024 International Annual Conference (2024)
    Preview abstract Technology is a product of society. As technology evolves, the norms governing it have to mature for enabling its proper use within the society. The interest in Artificial Intelligence (AI) has surged following the introduction of chatGPT. Firms, both large and small, are competing to develop new products and solutions involving AI. Amidst these developments, leading corporations such as Google and Microsoft have proactively committed to upholding responsible innovation in AI development. Governments worldwide are responding with the creation of guidelines and regulations in the field. Notably, in March 2024, the United Nations General Assembly (UNGA) adopted landmark regulation on AI. At the heart of these developments in AI are engineering managers who leverage technical advances to build products and services that create value. To effectively harness AI for human benefit, engineering managers must be aware of these evolving regulations governing AI. Some regulations such as Digital Markets Act (DMA) and General Data Protection Regulations (GDPR) have far reaching consequences for organizations globally. Having a working knowledge of these statutory requirements will enable engineering managers to identify the opportunities and constraints in leveraging AI technology while building products and services. It will allow them to make informed decisions about data collection methods, model training processes, the deployment of AI systems and metrics for their evaluation. At scale, it can become a competitive advantage for the firms they work in, as explored through real-world examples in this paper. View details