Publications

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

people standing in front of a screen with images and a chipboard

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
1 - 15 of 10129 publications
    Preview abstract Text-to-image diffusion models have demonstrated remarkable capabilities in transforming textual prompts into coherent images, yet the computational cost of their inference remains a persistent challenge. To address this issue, we present UFOGen, a novel generative model designed for ultra-fast, one-step text-to-image synthesis. In contrast to conventional approaches that focus on improving samplers or employing distillation techniques for diffusion models, UFOGen adopts a hybrid methodology, integrating diffusion models with a GAN objective. Leveraging a newly introduced diffusion-GAN objective and initialization with pre-trained diffusion models, UFOGen excels in efficiently generating high-quality images conditioned on textual descriptions in a single step. Beyond traditional text-to-image generation, UFOGen showcases versatility in applications. Notably, UFOGen stands among the pioneering models enabling one-step text-to-image generation and diverse downstream tasks, presenting a significant advancement in the landscape of efficient generative models. View details
    ChatDirector: Enhancing Video Conferencing with Space-Aware Scene Rendering and Speech-Driven Layout Transition
    Brian Moreno Collins
    Karthik Ramani
    Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, ACM, pp. 16 (to appear)
    Preview abstract Remote video conferencing systems (RVCS) are widely adopted in personal and professional communication. However, they often lack the co-presence experience of in-person meetings. This is largely due to the absence of intuitive visual cues and clear spatial relationships among remote participants, which can lead to speech interruptions and loss of attention. This paper presents ChatDirector, a novel RVCS that overcomes these limitations by incorporating space-aware visual presence and speech-aware attention transition assistance. ChatDirector employs a real-time pipeline that converts participants' RGB video streams into 3D portrait avatars and renders them in a virtual 3D scene. We also contribute a decision tree algorithm that directs the avatar layouts and behaviors based on participants' speech states. We report on results from a user study (N=16) where we evaluated ChatDirector. The satisfactory algorithm performance and complimentary subject user feedback imply that ChatDirector significantly enhances communication efficacy and user engagement. View details
    Creative ML Assemblages: The Interactive Politics of People, Processes, and Products
    Ramya Malur Srinivasan
    Katharina Burgdorf
    Jennifer Lena
    ACM Conference on Computer Supported Cooperative Work and Social Computing (2024) (to appear)
    Preview abstract Creative ML tools are collaborative systems that afford artistic creativity through their myriad interactive relationships. We propose using ``assemblage thinking" to support analyses of creative ML by approaching it as a system in which the elements of people, organizations, culture, practices, and technology constantly influence each other. We model these interactions as ``coordinating elements" that give rise to the social and political characteristics of a particular creative ML context, and call attention to three dynamic elements of creative ML whose interactions provide unique context for the social impact a particular system as: people, creative processes, and products. As creative assemblages are highly contextual, we present these as analytical concepts that computing researchers can adapt to better understand the functioning of a particular system or phenomena and identify intervention points to foster desired change. This paper contributes to theorizing interactions with AI in the context of art, and how these interactions shape the production of algorithmic art. View details
    Preview abstract In this paper we study users' opinions about the privacy of their mobile health apps. We look at what they write in app reviews in the 'Health & Fitness' category on the Google Play store. We identified 2832 apps in this category (based on 1K minimum installs). Using NLP/LLM analyses, we find that 76% of these apps have at least some privacy reviews. In total this yields over 164,000 reviews about privacy, from over 150 countries and in 25 languages. Our analyses identifies top themes and offers an approximation of how widespread these issues are around the world. We show that the top 4 themes - Data Sharing and Exposure, Permission Requests, Location Tracking and Data Collection - are issues of concern in over 70 countries. Our automatically generated thematic summaries reveal interesting aspects that deserve further research around user suspicions (unneeded data collection), user requests (more fine-grained control over data collection and data access), as well as user behavior (uninstalling apps). View details
    Preview abstract Learning from Label Proportions (LLP) is a learning problem where only aggregate level labels are available for groups of instances, called bags, during training, and the aim is to get the best performance at the instance-level on the test data. This setting arises in domains like advertising and medicine due to privacy considerations. We propose a novel algorithmic framework for this problem that iteratively performs two main steps. For the first step (Pseudo Labeling) in every iteration, we define a Gibbs distribution over binary instance labels that incorporates a) covariate information through the constraint that instances with similar covariates should have similar labels and b) the bag level aggregated label. We then use Belief Propagation (BP) to marginalize the Gibbs distribution to obtain pseudo labels. In the second step (Embedding Refinement), we use the pseudo labels to provide supervision for a learner that yields a better embedding. Further, we iterate on the two steps again by using the second step's embeddings as new covariates for the next iteration. In the final iteration, a classifier is trained using the pseudo labels. Our algorithm displays strong gains against several SOTA baselines for the LLP Binary Classification problem on various dataset types - Small Tabular, Large Tabular and Images. We achieve these improvements with minimal computational overhead above standard supervised learning due to Belief Propagation, for large bag sizes, even for a million samples. View details
    Preview abstract Effective model calibration is a critical and indispensable component in developing Media Mix Models (MMMs). One advantage of Bayesian-based MMMs lies in their capacity to accommodate the information from experiment results and the modelers' domain knowledge about the ad effectiveness by setting priors for the model parameters. However, it remains ambiguous about how and which Bayesian priors should be tuned for calibration purpose. In this paper, we propose a new calibration method through model reparameterization. The reparameterized model includes Return on Ads Spend (ROAS) as a model parameter, enabling straightforward adjustment of its prior distribution to align with either experiment results or the modeler's prior knowledge. The proposed method also helps address several key challenges regarding combining MMMs and incrementality experiments. We use simulations to demonstrate that our approach can significantly reduce the bias and uncertainty in the resultant posterior ROAS estimates. View details
    Preview abstract While most transliteration research is focused on single tokens such as named entities -- e.g., transliteration of "અમદાવાદ" from the Gujarati script to the Latin script "Ahmedabad" -- the informal romanization prevalent in South Asia and elsewhere often requires transliteration of full sentences. The lack of large parallel text collections of full sentence (as opposed to single word) transliterations necessitates incorporation of contextual information into transliteration via non-parallel resources, such as via mono-script text collections. In this paper, we present a number of methods for improving transliteration in context for such a use scenario. Some of these methods in fact improve performance without making use of sentential context, allowing for better quantification of the degree to which contextual information in particular is responsible for system improvements. Our final systems, which ultimately rely upon ensembles including large pretrained language models finetuned on simulated parallel data, yield substantial improvements over the best previously reported results for full sentence transliteration from Latin to native script on all 12 languages in the Dakshina dataset (Roark et al. 2020), with an overall 4.8% absolute (27.1% relative) mean word-error rate reduction. View details
    Visual Grounding for User Interfaces
    Yijun Qian
    Yujie Lu
    Alexander G. Hauptmann
    2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2024) - Industry Track
    Preview abstract Enabling autonomous language agents to drive application user interfaces (UIs) as humans do can significantly expand the capability of today's API-based agents. Essential to this vision is the ability of agents to ground natural language commands to on-screen UI elements. Prior UI grounding approaches work by relaying on developer-provided UI metadata (UI trees, such as web DOM, and accessibility labels) to detect on-screen elements. However, such metadata is often unavailable or incomplete. Object detection techniques applied to UI screens remove this dependency, by inferring location and types of UI elements directly from the UI's visual appearance. The extracted semantics, however, are too limited to directly enable grounding. We overcome the limitations of both approaches by introducing the task of visual UI grounding, which unifies detection and grounding. A model takes as input a UI screenshot and a free-form language expression, and must identify the referenced UI element. We propose a solution to this problem, LVG, which learns UI element detection and grounding using a new technique called layout-guided contrastive learning, where the semantics of individual UI objects are learned also from their visual organization. Due to the scarcity of UI datasets, LVG integrates synthetic data in its training using multi-context learning. LVG outperforms baselines pre-trained on much larger datasets by over 4.9 points in top-1 accuracy, thus demonstrating its effectiveness. View details
    Generalizing Tree-Level Sap Flow Across the European Continent
    Ralf Loritz
    Chen Huan Wu
    Daniel Klotz
    Martin Gauch
    Frederik Kratzert
    Maoya Bassiouni
    Geophysical Research Letters (2024)
    Preview abstract Sap flow offers key insights about transpiration dynamics and forest-climate interactions. Accurately simulating sap flow remains challenging due to measurement uncertainties and interactions between global and local environmental controls. Addressing these complexities, this study leveraged Long Short-Term Memory networks (LSTMs) with SAPFLUXNET to predict hourly tree-level sap flow across Europe. We built models with diverse training sets to assess performance under previously unseen conditions. The average Kling-Gupta Efficiency was 0.77 for models trained on 50% of time series across all forest stands, and 0.52 for models trained on 50% of the forest stands. Continental models not only matched but surpassed the performance of specialized and baselines for all genera and forest types, showcasing the capacity of LSTMs to effectively generalize across tree genera, climates, and forest ecosystems given minimal inputs. This study underscores the potential of LSTMs in generalizing state-dependent ecohydrological processes and bridging tree level measurements to continental scales. View details
    Dynamics of magnetization at infinite temperature in a Heisenberg spin chain
    Trond Andersen
    Rhine Samajdar
    Andre Petukhov
    Jesse Hoke
    Dmitry Abanin
    ILYA Drozdov
    Xiao Mi
    Alexis Morvan
    Charles Neill
    Rajeev Acharya
    Richard Ross Allen
    Kyle Anderson
    Markus Ansmann
    Frank Arute
    Kunal Arya
    Juan Atalaya
    Gina Bortoli
    Alexandre Bourassa
    Leon Brill
    Michael Broughton
    Bob Buckley
    Tim Burger
    Nicholas Bushnell
    Juan Campero
    Hung-Shen Chang
    Jimmy Chen
    Benjamin Chiaro
    Desmond Chik
    Josh Cogan
    Roberto Collins
    Paul Conner
    William Courtney
    Alex Crook
    Ben Curtin
    Agustin Di Paolo
    Andrew Dunsworth
    Clint Earle
    Lara Faoro
    Edward Farhi
    Reza Fatemi
    Vinicius Ferreira
    Ebrahim Forati
    Brooks Foxen
    Gonzalo Garcia
    Élie Genois
    William Giang
    Dar Gilboa
    Raja Gosula
    Alejo Grajales Dau
    Steve Habegger
    Michael Hamilton
    Monica Hansen
    Sean Harrington
    Paula Heu
    Gordon Hill
    Markus Hoffmann
    Trent Huang
    Ashley Huff
    Bill Huggins
    Sergei Isakov
    Justin Iveland
    Cody Jones
    Pavol Juhas
    Marika Kieferova
    Alexei Kitaev
    Andrey Klots
    Alexander Korotkov
    Fedor Kostritsa
    John Mark Kreikebaum
    Dave Landhuis
    Pavel Laptev
    Kim Ming Lau
    Lily Laws
    Joonho Lee
    Kenny Lee
    Yuri Lensky
    Alexander Lill
    Wayne Liu
    Salvatore Mandra
    Orion Martin
    Steven Martin
    Seneca Meeks
    Amanda Mieszala
    Shirin Montazeri
    Ramis Movassagh
    Wojtek Mruczkiewicz
    Ani Nersisyan
    Michael Newman
    JiunHow Ng
    Murray Ich Nguyen
    Tom O'Brien
    Seun Omonije
    Alex Opremcak
    Rebecca Potter
    Leonid Pryadko
    David Rhodes
    Charles Rocque
    Negar Saei
    Kannan Sankaragomathi
    Henry Schurkus
    Christopher Schuster
    Mike Shearn
    Aaron Shorter
    Noah Shutty
    Vladimir Shvarts
    Vlad Sivak
    Jindra Skruzny
    Clarke Smith
    Rolando Somma
    George Sterling
    Doug Strain
    Marco Szalay
    Doug Thor
    Alfredo Torres
    Guifre Vidal
    Cheng Xing
    Jamie Yao
    Ping Yeh
    Juhwan Yoo
    Grayson Young
    Yaxing Zhang
    Ningfeng Zhu
    Jeremy Hilton
    Anthony Megrant
    Yu Chen
    Vadim Smelyanskiy
    Vedika Khemani
    Sarang Gopalakrishnan
    Tomaž Prosen
    Science, 384 (2024), pp. 48-53
    Preview abstract Understanding universal aspects of quantum dynamics is an unresolved problem in statistical mechanics. In particular, the spin dynamics of the one-dimensional Heisenberg model were conjectured as to belong to the Kardar-Parisi-Zhang (KPZ) universality class based on the scaling of the infinite-temperature spin-spin correlation function. In a chain of 46 superconducting qubits, we studied the probability distribution of the magnetization transferred across the chain’s center, P(M). The first two moments of P(M) show superdiffusive behavior, a hallmark of KPZ universality. However, the third and fourth moments ruled out the KPZ conjecture and allow for evaluating other theories. Our results highlight the importance of studying higher moments in determining dynamic universality classes and provide insights into universal behavior in quantum systems. View details
    Preview abstract Task-oriented queries (e.g., one-shot queries to play videos, order food, or call a taxi) are crucial for assessing the quality of virtual assistants, chatbots, and other large language model (LLM)-based services. However, a standard benchmark for task-oriented queries is not yet available, as existing benchmarks in the relevant NLP (Natural Language Processing) fields have primarily focused on task-oriented dialogues. Thus, we present a new methodology for efficiently generating the Task-oriented Queries Benchmark (ToQB) using existing task-oriented dialogue datasets and an LLM service. Our methodology involves formulating the underlying NLP task to summarize the original intent of a speaker in each dialogue, detailing the key steps to perform the devised NLP task using an LLM service, and outlining a framework for automating a major part of the benchmark generation process. Through a case study encompassing three domains (i.e., two single-task domains and one multi-task domain), we demonstrate how to customize the LLM prompts (e.g., omitting system utterances or speaker labels) for those three domains and characterize the generated task-oriented queries. The generated ToQB dataset is made available to the public.We further discuss new domains that can be added to ToQB by community contributors and its practical applications. View details
    Preview abstract Instruction tuning has emerged as the key in aligning large language models (LLMs) with specific task instructions, thereby mitigating the discrepancy between the next-token prediction objective and users' actual goals. To reduce the labor and time cost to collect or annotate data by humans, researchers start to explore the use of LLMs to generate instruction-aligned synthetic data. Recent works focus on generating diverse instructions and applying LLM to increase instruction complexity, often neglecting downstream use cases. It remains unclear how to tailor high-quality data to elicit better instruction-following abilities in different target instruction distributions and LLMs. To this end, we introduce CodecLM, a general framework for adaptively generating high-quality synthetic data for LLM alignment with different downstream instruction distributions and LLMs. Drawing on the Encode-Decode principles, we use LLMs as codecs to guide the data generation process. We first encode seed instructions into metadata, which are concise keywords generated on-the-fly to capture the target instruction distribution, and then decode metadata to create tailored instructions. We also introduce Self-Rubrics and Contrastive Filtering during decoding to tailor data-efficient samples. Extensive experiments on four open-domain instruction following benchmarks validate the effectiveness of CodecLM over the current state-of-the-arts. View details
    Taming Self-Training for Open-Vocabulary Object Detection
    Shiyu Zhao
    Samuel Schulter
    Zhixing Zhang
    Vijay Kumar B G
    Yumin Suh
    Manmohan Chandraker
    Dimitris N. Metaxas
    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)
    Preview abstract Recent studies have shown promising performance in open-vocabulary object detection (OVD) by utilizing pseudo labels (PLs) from pretrained vision and language models (VLMs). However, teacher-student self-training, a powerful and widely used paradigm to leverage PLs, is rarely explored for OVD. This work identifies two challenges of using self-training in OVD: noisy PLs from VLMs and frequent distribution changes of PLs. To address these challenges, we propose SAS-Det that tames self-training for OVD from two key perspectives. First, we present a split-and-fusion (SAF) head that splits a standard detection into an open-branch and a closed-branch. This design can reduce noisy supervision from pseudo boxes. Moreover, the two branches learn complementary knowledge from different training data, significantly enhancing performance when fused together. Second, in our view, unlike in closed-set tasks, the PL distributions in OVD are solely determined by the teacher model. We introduce a periodic update strategy to decrease the number of updates to the teacher, thereby decreasing the frequency of changes in PL distributions, which stabilizes the training process. Extensive experiments demonstrate SAS-Det is both efficient and effective. SAS-Det outperforms recent models of the same scale by a clear margin and achieves 37.4 AP50 and 29.1 APr on novel categories of the COCO and LVIS benchmarks, respectively. View details
    Preview abstract A lexicographic maximum of a set $X \subseteq R^n$ is a vector in $X$ whose smallest component is as large as possible, and subject to that requirement, whose second smallest component is as large as possible, and so on for the third smallest component, etc. Lexicographic maximization has numerous practical and theoretical applications, including fair resource allocation, analyzing the implicit regularization of learning algorithms, and characterizing refinements of game-theoretic equilibria. We prove that a minimizer in $X$ of the exponential loss function $L_c(x) = \sum_i \exp(-c x_i)$ converges to a lexicographic maximum of $X$ as $c \rightarrow \infty$, provided that $X$ is stable in the sense that a well-known iterative method for finding a lexicographic maximum of $X$ cannot be made to fail simply by reducing the required quality of each iterate by an arbitrarily tiny degree. Our result holds for both near and exact minimizers of the exponential loss, while earlier convergence results made much stronger assumptions about the set $X$ and only held for the exact minimizer. We are aware of no previous results showing a connection between the iterative method for computing a lexicographic maximum and exponential loss minimization. We show that every convex polytope is stable, but that there exist compact, convex sets that are not stable. We also provide the first analysis of the convergence rate of an exponential loss minimizer (near or exact) and discover a curious dichotomy: While the two smallest components of the vector converge to the lexicographically maximum values very quickly (at roughly the rate $(\log n)/c$), all other components can converge arbitrarily slowly. View details
    Preview abstract Stereotypes are oversimplified beliefs and ideas about particular groups of people. These cognitive biases are omnipresent in our language, reflected in human-generated dataset and potentially learned and perpetuated by language technologies. Although mitigating stereotypes in language technologies is necessary for preventing harms, stereotypes can impose varying levels of risks for targeted individuals and social groups by appearing in various contexts. Technical challenges in detecting stereotypes are rooted in the societal nuances of stereotyping, making it impossible to capture all intertwined interactions of social groups in diverse cultural context in one generic benchmark. This paper delves into the nuances of detecting stereotypes in an annotation task with humans from various regions of the world. We iteratively disambiguate our definition of the task, refining it as detecting ``generalizing language'' and contribute a multilingual, annotated dataset consisting of sentences mentioning a wide range of social identities in 9 languages and labeled on whether they make broad statements and assumptions about those groups. We experiment with training generalizing language detection models, which provide insight about the linguistic context in which stereotypes can appear, facilitating future research in addressing the dynamic, social aspects of stereotypes. View details