Publications

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

people standing in front of a screen with images and a chipboard

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
1 - 15 of 10132 publications
    Preview abstract Misgendering refers to the act of incorrectly identifying or addressing someone's gender. While misgendering is both a factual inaccuracy and a toxic act of identity erasure, research on fact-checking and toxicity detection does not address it. We are the first to bridge this gap by introducing a dataset, \dataset, to assist in developing interventions for misgendering. The misgendering interventions task can be divided into two sub-tasks: (i) detecting misgendering, followed by (ii) editing misgendering where misgendering is present, in domains where editing is appropriate. We introduce a dataset containing a total of 3806 instances of tweets, YouTube comments, and LLM-generated text about 30 non-cisgender individuals annotated for whether they contain misgendering or not. LLM-generated text is also annotated for edits required to fix misgendering. Using this dataset, we set initial benchmarks by evaluating existing NLP systems and highlight challenges for future models to address. Additionally, we conducted a survey of non-cisgender individuals in the US to understand opinions about automated interventions for text-based misgendering. We find interest for interventions along with concerns for potential harm. View details
    Preview abstract We present a new task and dataset, ScreenQA, for screen content understanding via question answering. The existing screen datasets are focused either on structure and component-level understanding, or on a much higher-level composite task such as navigation and task completion. We attempt to bridge the gap between these two by annotating 86K question-answer pairs over the RICO dataset in hope to benchmark the screen reading comprehension capacity. View details
    Embedding-Aligned Language Models
    Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS-24), Vancouver (2024)
    Preview abstract We propose a novel approach for training large language models (LLMs) to adhere to objectives imposed by a latent embedding space. Our method leverages reinforcement learning (RL), treating a pre-trained LLM as an environment. An Embedding-Aligned Guided LanguagE (EAGLE) agent it trained using a significantly smaller language model to iteratively stir the LLM's generation towards optimal regions of a latent embedding space, given some predefined criteria. We demonstrate the effectiveness of the EAGLE agent using the MovieLens 25M dataset, on extrapolation tasks for content gap to satisfy latent user demand, and multi-attribute satisfaction for generating creative variations of entities. Our work paves the way for controlled and grounded text generation using LLMs, ensuring consistency with domain-specific knowledge and data representations. View details
    ChatDirector: Enhancing Video Conferencing with Space-Aware Scene Rendering and Speech-Driven Layout Transition
    Brian Moreno Collins
    Karthik Ramani
    Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, ACM, pp. 16 (to appear)
    Preview abstract Remote video conferencing systems (RVCS) are widely adopted in personal and professional communication. However, they often lack the co-presence experience of in-person meetings. This is largely due to the absence of intuitive visual cues and clear spatial relationships among remote participants, which can lead to speech interruptions and loss of attention. This paper presents ChatDirector, a novel RVCS that overcomes these limitations by incorporating space-aware visual presence and speech-aware attention transition assistance. ChatDirector employs a real-time pipeline that converts participants' RGB video streams into 3D portrait avatars and renders them in a virtual 3D scene. We also contribute a decision tree algorithm that directs the avatar layouts and behaviors based on participants' speech states. We report on results from a user study (N=16) where we evaluated ChatDirector. The satisfactory algorithm performance and complimentary subject user feedback imply that ChatDirector significantly enhances communication efficacy and user engagement. View details
    Assistive AI in Lung Cancer Screening: A Retrospective Multinational Study in the United States and Japan
    Atilla Kiraly
    Corbin Cunningham
    Ryan Najafi
    Jie Yang
    Chuck Lau
    Diego Ardila
    Scott Mayer McKinney
    Rory Pilgrim
    Mozziyar Etemadi
    Sunny Jansen
    Lily Peng
    Shravya Shetty
    Neeral Beladia
    Krish Eswaran
    Radiology: Artificial Intelligence (2024)
    Preview abstract Lung cancer is the leading cause of cancer death world-wide with 1.8 million deaths in 20201. Studies have concluded that low-dose computed tomography lung cancer screening can reduce mortality by up to 61%2 and updated 2021 US guidelines expanded eligibility. As screening efforts rise, AI can play an important role, but must be unobtrusively integrated into existing clinical workflows. In this work, we introduce a state-of-the-art, cloud-based AI system providing lung cancer risk assessments without requiring any user input. We demonstrate its efficacy in assisting lung cancer screening under both US and Japanese screening settings using different patient populations and screening protocols. Technical improvements over a previously described system include a focus on earlier cancer detection for improved accuracy, introduction of an effective assistive user interface, and a system designed to integrate into typical clinical workflows. The stand-alone AI system was evaluated on 3085 individuals achieving area under the curve (AUC) scores of 91.7% (95%CI [89.6, 95.2]), 93.3% (95%CI [90.2, 95.7]), and 89.1% (95%CI [77.7, 97.3]) on three datasets (two from US and one from Japan), respectively. To evaluate the system’s assistive ability, we conducted two retrospective multi-reader multi-case studies on 627 cases read by experienced board certified radiologists (average 20 years of experience [7,40]) using local PACS systems in the respective US and Japanese screening settings. The studies measured the reader’s level of suspicion (LoS) and categorical responses for scores and management recommendations under country-specific screening protocols. The radiologists’ AUC for LoS increased with AI assistance by 2.3% (95%CI [0.1-4.5], p=0.022) for the US study and by 2.3% (95%CI [-3.5-8.1], p=0.179) for the Japan study. Specificity for recalls increased by 5.5% (95%CI [2.7-8.5], p<0.0001) for the US and 6.7% (95%CI [4.7-8.7], p<0.0001) for the Japan study. No significant reduction in other metrics occured. This work advances the state-of-the-art in lung cancer detection, introduces generalizable interface concepts that can be applicable to similar AI applications, and demonstrates its potential impact on diagnostic AI in global lung cancer screening with results suggesting a substantial drop in unnecessary follow-up procedures without impacting sensitivity. View details
    Preview abstract Use of Text-to-Image models is expanding beyond generating generic objects, as they are increasingly being adopted by diverse global communities to create visual representations of their unique culture. Current T2I benchmarks primarily evaluate image-text alignment, aesthetics and fidelity of generations for complex prompts with generic objects, overlooking the critical dimension of cultural understanding. In this work, we address this gap by defining a framework to evaluate cultural competence of T2I models, and present a scalable approach to collect cultural artifacts unique to a particular culture from Knowledge Graphs and Large Language Models in tandem. We assess the ability of state-of-the-art T2I models to generate culturally faithful and realistic images across 8 countries and 3 cultural domains. Furthermore, we emphasize the importance of T2I models reflecting a culture's diversity and introduce cultural diversity as a novel metric for T2I evaluation, drawing inspiration from the Vendi Score. We introduce T2I-GCube, a first-of-its-kind benchmark for T2I evaluation. T2I-GCube includes cultural prompts, metrics, and cultural concept spaces, enabling comprehensive assessment of T2I models' cultural knowledge and diversity. Our evaluations reveal significant gaps in the cultural knowledge of existing models and provide valuable insights into the diversity of image outputs for under-specified prompts. By introducing a novel approach to evaluating cultural diversity and knowledge in T2I models, T2I-GCube will be instrumental in fostering the development of models with enhanced cultural competence. View details
    Shorts vs. Regular Videos on YouTube: A Comparative Analysis of User Engagement and Content Creation Trends
    Caroline Violot
    Tugrulcan Elmais
    Mathias Humbert
    ACM Web Science Conference 2024 (WEBSCI24) (2024)
    Preview abstract YouTube introduced the Shorts video format in 2021, allowing users to upload short videos that are prominently displayed on its website and app. Despite having such a large visual footprint, there are no studies to date that have looked at the impact Shorts introduction had on the production and consumption of content on YouTube. This paper presents the first comparative analysis of YouTube Shorts versus regular videos with respect to user engagement (i.e., views, likes, and comments), content creation frequency and video categories. We collected a dataset containing information about 70k channels that posted at least one Short, and we analyzed the metadata of all the videos (9.9M Shorts and 6.9M regular videos) they uploaded between January 2021 and December 2022, spanning a two-year period including the introduction of Shorts. Our longitudinal analysis shows that content creators consistently increased the frequency of Shorts production over this period, especially for newly-created channels, which surpassed that of regular videos. We also observe that Shorts target mostly entertainment categories, while regular videos cover a wide variety of categories. In general, Shorts attract more views and likes per view than regular videos, but attract less comments per view. However, Shorts do not outperform regular videos in the education and political categories as much as they do in other categories. Our study contributes to understanding social media dynamics, to quantifying the spread of short-form content, and to motivating future research on its impact on society. View details
    Prompt-Based Label-Aware Framework for Few-Shot Multi-Label Text Classification
    Thanakorn Thaminkaew
    Peerapon Vateekul
    IEEE Access, 12 (2024), pp. 28310-28322
    Preview abstract Prompt-based learning has demonstrated remarkable success in few-shot text classification, outperforming the traditional fine-tuning approach. This method transforms a text input into a masked language modeling prompt using a template, queries a fine-tuned language model to fill in the mask, and then uses a verbalizer to map the model’s output to a predicted class. Previous prompt-based text classification approaches were primarily designed for multi-class classification, taking advantage of the fact that the classes are mutually exclusive and one example belongs to only one class. However, these assumptions do not hold in the context of multi-label text classification, where labels often exhibit correlations with each other. Therefore, we propose a Prompt-based Label-Aware framework for Multi-Label text classification (PLAML) that addresses the challenges. Specifically, PLAML enhances prompt-based learning with three proposed techniques to improve the overall performance for multi-label classification. The techniques include (i) a token weighting algorithm that considers the correlations between labels, (ii) a template for augmenting training samples, making the training process label-aware, and (iii) a dynamic threshold mechanism, refining the prediction condition of each label. Extensive experiments on few-shot text classification across multiple datasets with various languages show that our PLAML outperforms other baseline methods. We also analyzed the effect of each proposed technique to better understand how it is suitable for the multi-label setting. View details
    Preview abstract Detecting offensive content in text is an increasingly central challenge for both social-media platforms and AI-driven technologies. However offensiveness remains a subjective phenomenon as perspectives differ across sociodemographic characteristics, as well as cultural norms and moral values. This intricacy is largely ignored in the current AI-focused approaches for detecting offensiveness or related concepts such as hate speech and toxicity detection. We frame the task of determining offensiveness as essentially a matter of moral judgment --- deciding the boundaries of ethically wrong vs. right language to be used or generated within an implied set of sociocultural norms. In this paper, we investigate how judgment of offensiveness varies across diverse global cultural regions, and the crucial role of moral values in shaping these variations. Our findings highlight substantial cross-cultural differences in perceiving offensiveness, with moral concerns about Caring and Purity as the mediating factor driving these differences. These insights are of importance as AI safety protocols, shaped by human annotators' inputs and perspectives, embed their moral values which do not align with the notions of right and wrong in all contexts, and for all individuals. View details
    Human I/O: Towards Comprehensive Detection of Situational Impairments in Everyday Activities
    Xingyu Bruce Liu
    Jiahao Nick Li
    Xiang 'Anthony' Chen
    Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, ACM, pp. 18
    Preview abstract Situationally Induced Impairments and Disabilities (SIIDs) can significantly hinder user experience in everyday activities. Despite their prevalence, existing adaptive systems predominantly cater to specific tasks or environments and fail to accommodate the diverse and dynamic nature of SIIDs. We introduce Human I/O, a real-time system that detects SIIDs by gauging the availability of human input/output channels. Leveraging egocentric vision, multimodal sensing and reasoning with large language models, Human I/O achieves good performance in availability prediction across 60 in-the-wild egocentric videos in 32 different scenarios. Further, while the core focus of our work is on the detection of SIIDs rather than the creation of adaptive user interfaces, we showcase the utility of our prototype via a user study with 10 participants. Findings suggest that Human I/O significantly reduces effort and improves user experience in the presence of SIIDs, paving the way for more adaptive and accessible interactive systems in the future. View details
    Practical Performance Guarantees for Pipelined DNN Inference
    Kuikui Liu
    Proceedings of the 41st International Conference on Machine Learning (2024), pp. 1655-1671
    Preview abstract This work optimizes pipeline parallelism of machine learning model inference by partitioning computation graphs into $k$ stages and minimizing the running time of the bottleneck stage. We design practical algorithms for this NP-complete problem and prove they are nearly optimal in practice by comparing against lower bounds obtained from solving novel mixed-integer programming (MIP) formulations. We apply these algorithms and lower-bound techniques to production models to achieve substantial improvements in the approximation guarantees, compared to simple combinatorial lower bounds. For example, our new MIP formulations improve the lower bounds enough to drop the geometric mean approximation ratio from $2.175$ to $1.082$ across production data with $k=16$ pipeline stages. This work shows that while bottleneck partitioning is theoretically hard, in practice we have a handle on the algorithmic side of the problem and much of the remaining challenge is in developing more accurate cost models to give to the partitioning algorithms. View details
    Preview abstract Recent developments in large language models (LLMs) have shown promise in their ability to generate synthetic query-document pairs by prompting LLMs with as few as 8 demonstrations \cite{dai2022promptagator}. This has enabled building better IR models especially for tasks which have no training data readily available. Typically, such synthetic query generation (QGen) approaches condition on an input context (e.g. document) and generate a query that is relevant to that context or condition the QGen model additionally on the relevance label (e.g. relevant vs irrelevant) to generate queries across relevance buckets. However, we find that such QGen approaches are sub-optimal as it requires the model to reason about the desired label and the input from only a handful of examples, which is not trivial, especially when the relevance buckets are nuanced. In this work, we propose to reduce this burden of LLMs by generating queries simultaneously for different labels (e.g. relevance buckets). We hypothesize that instead of asking the model to generate, say, an irrelevant query given an input context, asking the model to generate an irrelevant query with respect to a relevant query is a much simpler task setup for the model to reason about. Extensive experimentation across seven IR datasets shows that synthetic queries generated in such a fashion translates to a better downstream performance, suggesting that the generated queries are indeed of higher quality. View details
    Connecting Language Technologies with Rich, Diverse Data Sources Covering Thousands of Languages
    Sebastian Ruder
    Julia Kreutzer
    Clara Rivera
    Ishank Saxena
    Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
    Preview abstract Contrary to common belief, there are rich and diverse data sources available for many thousands of languages, which can be used to develop technologies for these languages. In this paper, we provide an overview of some of the major online data sources, the types of data that they provide access to, potential applications of this data, and the number of languages that they cover. Even this covers only a small fraction of the data that exists; for example, printed books are published in many languages but few online aggregators exist. View details
    Optimizing quantum gates towards the scale of logical qubits
    Alexandre Bourassa
    Andrew Dunsworth
    Will Livingston
    Vlad Sivak
    Trond Andersen
    Yaxing Zhang
    Desmond Chik
    Jimmy Chen
    Charles Neill
    Alejo Grajales Dau
    Anthony Megrant
    Alexander Korotkov
    Vadim Smelyanskiy
    Yu Chen
    Nature Communications, 15 (2024), pp. 2442
    Preview abstract A foundational assumption of quantum error correction theory is that quantum gates can be scaled to large processors without exceeding the error-threshold for fault tolerance. Two major challenges that could become fundamental roadblocks are manufacturing high-performance quantum hardware and engineering a control system that can reach its performance limits. The control challenge of scaling quantum gates from small to large processors without degrading performance often maps to non-convex, high-constraint, and time-dynamic control optimization over an exponentially expanding configuration space. Here we report on a control optimization strategy that can scalably overcome the complexity of such problems. We demonstrate it by choreographing the frequency trajectories of 68 frequency-tunable superconducting qubits to execute single- and two-qubit gates while mitigating computational errors. When combined with a comprehensive model of physical errors across our processor, the strategy suppresses physical error rates by ~3.7× compared with the case of no optimization. Furthermore, it is projected to achieve a similar performance advantage on a distance-23 surface code logical qubit with 1057 physical qubits. Our control optimization strategy solves a generic scaling challenge in a way that can be adapted to a variety of quantum operations, algorithms, and computing architectures. View details
    General Geospatial Inference with a Population Dynamics Foundation Model
    Chaitanya Kamath
    Prithul Sarker
    Joydeep Paul
    Yael Mayer
    Sheila de Guia
    Jamie McPike
    Adam Boulanger
    David Schottlander
    Yao Xiao
    Manjit Chakravarthy Manukonda
    Monica Bharel
    Von Nguyen
    Luke Barrington
    Niv Efron
    Krish Eswaran
    Shravya Shetty
    (2024) (to appear)
    Preview abstract Supporting the health and well-being of dynamic populations around the world requires governmental agencies, organizations, and researchers to understand and reason over complex relationships between human behavior and local contexts. This support includes identifying populations at elevated risk and gauging where to target limited aid resources. Traditional approaches to these classes of problems often entail developing manually curated, task-specific features and models to represent human behavior and the natural and built environment, which can be challenging to adapt to new, or even related tasks. To address this, we introduce the Population Dynamics Foundation Model (PDFM), which aims to capture the relationships between diverse data modalities and is applicable to a broad range of geospatial tasks. We first construct a geo-indexed dataset for postal codes and counties across the United States, capturing rich aggregated information on human behavior from maps, busyness, and aggregated search trends, and environmental factors such as weather and air quality. We then model this data and the complex relationships between locations using a graph neural network, producing embeddings that can be adapted to a wide range of downstream tasks using relatively simple models. We evaluate the effectiveness of our approach by benchmarking it on 27 downstream tasks spanning three distinct domains: health indicators, socioeconomic factors, and environmental measurements. The approach achieves state-of-the-art performance on geospatial interpolation across all tasks, surpassing existing satellite and geotagged image based location encoders. In addition, it achieves state-of-the-art performance in extrapolation and super-resolution for 25 of the 27 tasks. We also show that the PDFM can be combined with a state-of-the-art forecasting foundation model, TimesFM, to predict unemployment and poverty, achieving performance that surpasses fully supervised forecasting. The full set of embeddings and sample code are publicly available for researchers. In conclusion, we have demonstrated a general purpose approach to geospatial modeling tasks critical to understanding population dynamics by leveraging a rich set of complementary globally available datasets that can be readily adapted to previously unseen machine learning tasks. View details