Publications
Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.
Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.
Sort By
1 - 15 of 10129 publications
Preview abstract
Text-to-image diffusion models have demonstrated remarkable capabilities in transforming textual prompts into coherent images, yet the computational cost of their inference remains a persistent challenge. To address this issue, we present UFOGen, a novel generative model designed for ultra-fast, one-step text-to-image synthesis. In contrast to conventional approaches that focus on improving samplers or employing distillation techniques for diffusion models, UFOGen adopts a hybrid methodology, integrating diffusion models with a GAN objective. Leveraging a newly introduced diffusion-GAN objective and initialization with pre-trained diffusion models, UFOGen excels in efficiently generating high-quality images conditioned on textual descriptions in a single step. Beyond traditional text-to-image generation, UFOGen showcases versatility in applications. Notably, UFOGen stands among the pioneering models enabling one-step text-to-image generation and diverse downstream tasks, presenting a significant advancement in the landscape of efficient generative models.
View details
ChatDirector: Enhancing Video Conferencing with Space-Aware Scene Rendering and Speech-Driven Layout Transition
Brian Moreno Collins
Karthik Ramani
Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, ACM, pp. 16 (to appear)
Preview abstract
Remote video conferencing systems (RVCS) are widely adopted in personal and professional communication. However, they often lack the co-presence experience of in-person meetings. This is largely due to the absence of intuitive visual cues and clear spatial relationships among remote participants, which can lead to speech interruptions and loss of attention. This paper presents ChatDirector, a novel RVCS that overcomes these limitations by incorporating space-aware visual presence and speech-aware attention transition assistance. ChatDirector employs a real-time pipeline that converts participants' RGB video streams into 3D portrait avatars and renders them in a virtual 3D scene. We also contribute a decision tree algorithm that directs the avatar layouts and behaviors based on participants' speech states. We report on results from a user study (N=16) where we evaluated ChatDirector. The satisfactory algorithm performance and complimentary subject user feedback imply that ChatDirector significantly enhances communication efficacy and user engagement.
View details
Creative ML Assemblages: The Interactive Politics of People, Processes, and Products
Ramya Malur Srinivasan
Katharina Burgdorf
Jennifer Lena
ACM Conference on Computer Supported Cooperative Work and Social Computing (2024) (to appear)
Preview abstract
Creative ML tools are collaborative systems that afford artistic creativity through their myriad interactive relationships. We propose using ``assemblage thinking" to support analyses of creative ML by approaching it as a system in which the elements of people, organizations, culture, practices, and technology constantly influence each other. We model these interactions as ``coordinating elements" that give rise to the social and political characteristics of a particular creative ML context, and call attention to three dynamic elements of creative ML whose interactions provide unique context for the social impact a particular system as: people, creative processes, and products. As creative assemblages are highly contextual, we present these as analytical concepts that computing researchers can adapt to better understand the functioning of a particular system or phenomena and identify intervention points to foster desired change. This paper contributes to theorizing interactions with AI in the context of art, and how these interactions shape the production of algorithmic art.
View details
Preview abstract
In this paper we study users' opinions about the privacy of their mobile health apps. We look at what they write in app reviews in the 'Health & Fitness' category on the Google Play store. We identified 2832 apps in this category (based on 1K minimum installs). Using NLP/LLM analyses, we find that 76% of these apps have at least some privacy reviews. In total this yields over 164,000 reviews about privacy, from over 150 countries and in 25 languages. Our analyses identifies top themes and offers an approximation of how widespread these issues are around the world. We show that the top 4 themes - Data Sharing and Exposure, Permission Requests, Location Tracking and Data Collection - are issues of concern in over 70 countries. Our automatically generated thematic summaries reveal interesting aspects that deserve further research around user suspicions (unneeded data collection), user requests (more fine-grained control over data collection and data access), as well as user behavior (uninstalling apps).
View details
Learning from Label Proportions: Bootstrapping Supervised Learners via Belief Propagation
Shreyas Havaldar
The Twelfth International Conference on Learning Representations (ICLR) (2024)
Preview abstract
Learning from Label Proportions (LLP) is a learning problem where only aggregate level labels are available for groups of instances, called bags, during training, and the aim is to get the best performance at the instance-level on the test data. This setting arises in domains like advertising and medicine due to privacy considerations. We propose a novel algorithmic framework for this problem that iteratively performs two main steps. For the first step (Pseudo Labeling) in every iteration, we define a Gibbs distribution over binary instance labels that incorporates a) covariate information through the constraint that instances with similar covariates should have similar labels and b) the bag level aggregated label. We then use Belief Propagation (BP) to marginalize the Gibbs distribution to obtain pseudo labels. In the second step (Embedding Refinement), we use the pseudo labels to provide supervision for a learner that yields a better embedding. Further, we iterate on the two steps again by using the second step's embeddings as new covariates for the next iteration. In the final iteration, a classifier is trained using the pseudo labels. Our algorithm displays strong gains against several SOTA baselines for the LLP Binary Classification problem on various dataset types - Small Tabular, Large Tabular and Images. We achieve these improvements with minimal computational overhead above standard supervised learning due to Belief Propagation, for large bag sizes, even for a million samples.
View details
Media Mix Model Calibration With Bayesian Priors
Mike Wurm
Brenda Price
Ying Liu
research.google (2024)
Preview abstract
Effective model calibration is a critical and indispensable component in developing Media Mix Models (MMMs). One advantage of Bayesian-based MMMs lies in their capacity to accommodate the information from experiment results and the modelers' domain knowledge about the ad effectiveness by setting priors for the model parameters. However, it remains ambiguous about how and which Bayesian priors should be tuned for calibration purpose. In this paper, we propose a new calibration method through model reparameterization. The reparameterized model includes Return on Ads Spend (ROAS) as a model parameter, enabling straightforward adjustment of its prior distribution to align with either experiment results or the modeler's prior knowledge. The proposed method also helps address several key challenges regarding combining MMMs and incrementality experiments. We use simulations to demonstrate that our approach can significantly reduce the bias and uncertainty in the resultant posterior ROAS estimates.
View details
Context-aware Transliteration of Romanized South Asian Languages
Christo Kirov
Computational Linguistics, 50 (2) (2024), 475–534
Preview abstract
While most transliteration research is focused on single tokens such as named entities -- e.g., transliteration of "અમદાવાદ" from the Gujarati script to the Latin script "Ahmedabad" -- the informal romanization prevalent in South Asia and elsewhere often requires transliteration of full sentences. The lack of large parallel text collections of full sentence (as opposed to single word) transliterations necessitates incorporation of contextual information into transliteration via non-parallel resources, such as via mono-script text collections. In this paper, we present a number of methods for improving transliteration in context for such a use scenario. Some of these methods in fact improve performance without making use of sentential context, allowing for better quantification of the degree to which contextual information in particular is responsible for system improvements. Our final systems, which ultimately rely upon ensembles including large pretrained language models finetuned on simulated parallel data, yield substantial improvements over the best previously reported results for full sentence transliteration from Latin to native script on all 12 languages in the Dakshina dataset (Roark et al. 2020), with an overall 4.8% absolute (27.1% relative) mean word-error rate reduction.
View details
Visual Grounding for User Interfaces
Yijun Qian
Yujie Lu
Alexander G. Hauptmann
2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2024) - Industry Track
Preview abstract
Enabling autonomous language agents to drive application user interfaces (UIs) as humans do can significantly expand the capability of today's API-based agents. Essential to this vision is the ability of agents to ground natural language commands to on-screen UI elements. Prior UI grounding approaches work by relaying on developer-provided UI metadata (UI trees, such as web DOM, and accessibility labels) to detect on-screen elements. However, such metadata is often unavailable or incomplete. Object detection techniques applied to UI screens remove this dependency, by inferring location and types of UI elements directly from the UI's visual appearance. The extracted semantics, however, are too limited to directly enable grounding. We overcome the limitations of both approaches by introducing the task of visual UI grounding, which unifies detection and grounding. A model takes as input a UI screenshot and a free-form language expression, and must identify the referenced UI element. We propose a solution to this problem, LVG, which learns UI element detection and grounding using a new technique called layout-guided contrastive learning, where the semantics of individual UI objects are learned also from their visual organization. Due to the scarcity of UI datasets, LVG integrates synthetic data in its training using multi-context learning. LVG outperforms baselines pre-trained on much larger datasets by over 4.9 points in top-1 accuracy, thus demonstrating its effectiveness.
View details
Generalizing Tree-Level Sap Flow Across the European Continent
Ralf Loritz
Chen Huan Wu
Daniel Klotz
Martin Gauch
Frederik Kratzert
Maoya Bassiouni
Geophysical Research Letters (2024)
Preview abstract
Sap flow offers key insights about transpiration dynamics and forest-climate interactions. Accurately simulating sap flow remains challenging due to measurement uncertainties and interactions between global and local environmental controls. Addressing these complexities, this study leveraged Long Short-Term Memory networks (LSTMs) with SAPFLUXNET to predict hourly tree-level sap flow across Europe. We built models with diverse training sets to assess performance under previously unseen conditions. The average Kling-Gupta Efficiency was 0.77 for models trained on 50% of time series across all forest stands, and 0.52 for models trained on 50% of the forest stands. Continental models not only matched but surpassed the performance of specialized and baselines for all genera and forest types, showcasing the capacity of LSTMs to effectively generalize across tree genera, climates, and forest ecosystems given minimal inputs. This study underscores the potential of LSTMs in generalizing state-dependent ecohydrological processes and bridging tree level measurements to continental scales.
View details
Dynamics of magnetization at infinite temperature in a Heisenberg spin chain
Trond Andersen
Rhine Samajdar
Andre Petukhov
Jesse Hoke
Dmitry Abanin
ILYA Drozdov
Xiao Mi
Alexis Morvan
Charles Neill
Rajeev Acharya
Richard Ross Allen
Kyle Anderson
Markus Ansmann
Frank Arute
Kunal Arya
Juan Atalaya
Gina Bortoli
Alexandre Bourassa
Leon Brill
Michael Broughton
Bob Buckley
Tim Burger
Nicholas Bushnell
Juan Campero
Hung-Shen Chang
Jimmy Chen
Benjamin Chiaro
Desmond Chik
Josh Cogan
Roberto Collins
Paul Conner
William Courtney
Alex Crook
Ben Curtin
Agustin Di Paolo
Andrew Dunsworth
Clint Earle
Lara Faoro
Edward Farhi
Reza Fatemi
Vinicius Ferreira
Ebrahim Forati
Brooks Foxen
Gonzalo Garcia
Élie Genois
William Giang
Dar Gilboa
Raja Gosula
Alejo Grajales Dau
Steve Habegger
Michael Hamilton
Monica Hansen
Sean Harrington
Paula Heu
Gordon Hill
Markus Hoffmann
Trent Huang
Ashley Huff
Bill Huggins
Sergei Isakov
Justin Iveland
Cody Jones
Pavol Juhas
Marika Kieferova
Alexei Kitaev
Andrey Klots
Alexander Korotkov
Fedor Kostritsa
John Mark Kreikebaum
Dave Landhuis
Pavel Laptev
Kim Ming Lau
Lily Laws
Joonho Lee
Kenny Lee
Yuri Lensky
Alexander Lill
Wayne Liu
Salvatore Mandra
Orion Martin
Steven Martin
Seneca Meeks
Amanda Mieszala
Shirin Montazeri
Ramis Movassagh
Wojtek Mruczkiewicz
Ani Nersisyan
Michael Newman
JiunHow Ng
Murray Ich Nguyen
Tom O'Brien
Seun Omonije
Alex Opremcak
Rebecca Potter
Leonid Pryadko
David Rhodes
Charles Rocque
Negar Saei
Kannan Sankaragomathi
Henry Schurkus
Christopher Schuster
Mike Shearn
Aaron Shorter
Noah Shutty
Vladimir Shvarts
Vlad Sivak
Jindra Skruzny
Clarke Smith
Rolando Somma
George Sterling
Doug Strain
Marco Szalay
Doug Thor
Alfredo Torres
Guifre Vidal
Cheng Xing
Jamie Yao
Ping Yeh
Juhwan Yoo
Grayson Young
Yaxing Zhang
Ningfeng Zhu
Jeremy Hilton
Anthony Megrant
Yu Chen
Vadim Smelyanskiy
Vedika Khemani
Sarang Gopalakrishnan
Tomaž Prosen
Science, 384 (2024), pp. 48-53
Preview abstract
Understanding universal aspects of quantum dynamics is an unresolved problem in statistical mechanics. In particular, the spin dynamics of the one-dimensional Heisenberg model were conjectured as to belong to the Kardar-Parisi-Zhang (KPZ) universality class based on the scaling of the infinite-temperature spin-spin correlation function. In a chain of 46 superconducting qubits, we studied the probability distribution of the magnetization transferred across the chain’s center, P(M). The first two moments of P(M) show superdiffusive behavior, a hallmark of KPZ universality. However, the third and fourth moments ruled out the KPZ conjecture and allow for evaluating other theories. Our results highlight the importance of studying higher moments in determining dynamic universality classes and provide insights into universal behavior in quantum systems.
View details
The Task-oriented Queries Benchmark (ToQB)
arXiv:2406.02943 (2024)
Preview abstract
Task-oriented queries (e.g., one-shot queries to play videos, order food, or call a taxi) are crucial for assessing the quality of virtual assistants, chatbots, and other large language model (LLM)-based services. However, a standard benchmark for task-oriented queries is not yet available, as existing benchmarks in the relevant NLP (Natural Language Processing) fields have primarily focused on task-oriented dialogues. Thus, we present a new methodology for efficiently generating the Task-oriented Queries Benchmark (ToQB) using existing task-oriented dialogue datasets and an LLM service. Our methodology involves formulating the underlying NLP task to summarize the original intent of a speaker in each dialogue, detailing the key steps to perform the devised NLP task using an LLM service, and outlining a framework for automating a major part of the benchmark generation process. Through a case study encompassing three domains (i.e., two single-task domains and one multi-task domain), we demonstrate how to customize the LLM prompts (e.g., omitting system utterances or speaker labels) for those three domains and characterize the generated task-oriented queries. The generated ToQB dataset is made available to the public.We further discuss new domains that can be added to ToQB by community contributors and its practical applications.
View details
CodecLM: Aligning Language Models with Tailored Synthetic Data
Chun-Liang Li
Jin Miao
NAACL 2024
Preview abstract
Instruction tuning has emerged as the key in aligning large language models (LLMs) with specific task instructions, thereby mitigating the discrepancy between the next-token prediction objective and users' actual goals. To reduce the labor and time cost to collect or annotate data by humans, researchers start to explore the use of LLMs to generate instruction-aligned synthetic data. Recent works focus on generating diverse instructions and applying LLM to increase instruction complexity, often neglecting downstream use cases. It remains unclear how to tailor high-quality data to elicit better instruction-following abilities in different target instruction distributions and LLMs. To this end, we introduce CodecLM, a general framework for adaptively generating high-quality synthetic data for LLM alignment with different downstream instruction distributions and LLMs. Drawing on the Encode-Decode principles, we use LLMs as codecs to guide the data generation process. We first encode seed instructions into metadata, which are concise keywords generated on-the-fly to capture the target instruction distribution, and then decode metadata to create tailored instructions. We also introduce Self-Rubrics and Contrastive Filtering during decoding to tailor data-efficient samples. Extensive experiments on four open-domain instruction following benchmarks validate the effectiveness of CodecLM over the current state-of-the-arts.
View details
Taming Self-Training for Open-Vocabulary Object Detection
Shiyu Zhao
Samuel Schulter
Zhixing Zhang
Vijay Kumar B G
Yumin Suh
Manmohan Chandraker
Dimitris N. Metaxas
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)
Preview abstract
Recent studies have shown promising performance in open-vocabulary object detection (OVD) by utilizing pseudo labels (PLs) from pretrained vision and language models (VLMs). However, teacher-student self-training, a powerful and widely used paradigm to leverage PLs, is rarely explored for OVD. This work identifies two challenges of using self-training in OVD: noisy PLs from VLMs and frequent distribution changes of PLs. To address these challenges, we propose SAS-Det that tames self-training for OVD from two key perspectives. First, we present a split-and-fusion (SAF) head that splits a standard detection into an open-branch and a closed-branch. This design can reduce noisy supervision from pseudo boxes. Moreover, the two branches learn complementary knowledge from different training data, significantly enhancing performance when fused together. Second, in our view, unlike in closed-set tasks, the PL distributions in OVD are solely determined by the teacher model. We introduce a periodic update strategy to decrease the number of updates to the teacher, thereby decreasing the frequency of changes in PL distributions, which stabilizes the training process. Extensive experiments demonstrate SAS-Det is both efficient and effective. SAS-Det outperforms recent models of the same scale by a clear margin and achieves 37.4 AP50 and 29.1 APr on novel categories of the COCO and LVIS benchmarks, respectively.
View details
Preview abstract
A lexicographic maximum of a set $X \subseteq R^n$ is a vector in $X$ whose smallest component is as large as possible, and subject to that requirement, whose second smallest component is as large as possible, and so on for the third smallest component, etc. Lexicographic maximization has numerous practical and theoretical applications, including fair resource allocation, analyzing the implicit regularization of learning algorithms, and characterizing refinements of game-theoretic equilibria. We prove that a minimizer in $X$ of the exponential loss function $L_c(x) = \sum_i \exp(-c x_i)$ converges to a lexicographic maximum of $X$ as $c \rightarrow \infty$, provided that $X$ is stable in the sense that a well-known iterative method for finding a lexicographic maximum of $X$ cannot be made to fail simply by reducing the required quality of each iterate by an arbitrarily tiny degree. Our result holds for both near and exact minimizers of the exponential loss, while earlier convergence results made much stronger assumptions about the set $X$ and only held for the exact minimizer. We are aware of no previous results showing a connection between the iterative method for computing a lexicographic maximum and exponential loss minimization. We show that every convex polytope is stable, but that there exist compact, convex sets that are not stable. We also provide the first analysis of the convergence rate of an exponential loss minimizer (near or exact) and discover a curious dichotomy: While the two smallest components of the vector converge to the lexicographically maximum values very quickly (at roughly the rate $(\log n)/c$), all other components can converge arbitrarily slowly.
View details
Preview abstract
Stereotypes are oversimplified beliefs and ideas about particular groups of people. These cognitive biases are omnipresent in our language, reflected in human-generated dataset and potentially learned and perpetuated by language technologies. Although mitigating stereotypes in language technologies is necessary for preventing harms, stereotypes can impose varying levels of risks for targeted individuals and social groups by appearing in various contexts. Technical challenges in detecting stereotypes are rooted in the societal nuances of stereotyping, making it impossible to capture all intertwined interactions of social groups in diverse cultural context in one generic benchmark. This paper delves into the nuances of detecting stereotypes in an annotation task with humans from various regions of the world. We iteratively disambiguate our definition of the task, refining it as detecting ``generalizing language'' and contribute a multilingual, annotated dataset consisting of sentences mentioning a wide range of social identities in 9 languages and labeled on whether they make broad statements and assumptions about those groups. We experiment with training generalizing language detection models, which provide insight about the linguistic context in which stereotypes can appear, facilitating future research in addressing the dynamic, social aspects of stereotypes.
View details