Publications

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

people standing in front of a screen with images and a chipboard

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
1 - 15 of 10128 publications
    Preview abstract We propose OmniNOCS, a large-scale monocular dataset with 3D Normalized Object Coordinate Space (NOCS) maps, object masks, and 3D bounding box annotations for indoor and outdoor scenes. OmniNOCS has 20 times more object classes and 200 times more instances than existing NOCS datasets (NOCS-Real275, Wild6D). We use OmniNOCS to train a novel, transformer-based monocular NOCS prediction model (NOCSformer) that can predict accurate NOCS, instance masks and poses from 2D object detections across diverse classes. It is the first NOCS model that can generalize to a broad range of classes when prompted with 2D boxes. We evaluate our model on the task of 3D oriented bounding box prediction, where it achieves comparable results to state-of-the-art 3D detection methods such as Cube R-CNN. Unlike other 3D detection methods, our model also provides detailed and accurate 3D object shape and segmentation. We propose a novel benchmark for the task of NOCS prediction based on OmniNOCS, which we hope will serve as a useful baseline for future work in this area. Our dataset and code is available at the project website: https://omninocs.github.io View details
    Generative Powers of Ten
    Xiaojuan Wang
    Steve Seitz
    Ben Mildenhall
    Pratul Srinivasan
    Dor Verbin
    Aleksander Hołyński
    Preview abstract We present a method that uses a text-to-image model to generate consistent content across multiple image scales, enabling extreme semantic zooms into a scene, e.g., ranging from a wide-angle landscape view of a forest to a macro shot of an insect sitting on one of the tree branches. This representation allows us to render continuously zooming videos, or explore different scales of the scene interactively. We achieve this through a joint multi-scale diffusion sampling approach that encourages consistency across different scales while preserving the integrity of each individual sampling process. Since each generated scale is guided by a different text prompt, our method enables deeper levels of zoom than traditional super-resolution methods that may struggle to create new contextual structure at vastly different scales. We compare our method qualitatively with alternative techniques in image super-resolution and outpainting, and show that our method is most effective at generating consistent multi-scale content. View details
    Individual Welfare Guarantees in the Autobidding World with Machine-learned Advice
    Negin Golrezaei
    Patrick Jaillet
    Jason Cheuk Nam Liang
    Proceedings of the ACM on Web Conference 2024, 267–275
    Preview abstract Online advertising channels commonly focus on maximizing total advertiser welfare to enhance channel health, and previous literature has studied augmenting ad auctions with machine learning predictions on advertiser values (also known asmachine-learned advice ) to improve total welfare. Yet, such improvements could come at the cost of individual bidders' welfare and do not shed light on how particular advertiser bidding strategies impact welfare. Motivated by this, we present an analysis on an individual bidder's welfare loss in the autobidding world for auctions with and without machine-learned advice, and also uncover how advertiser strategies relate to such losses. In particular, we demonstrate how ad platforms can utilize ML advice to improve welfare guarantee on the aggregate and individual bidder level by setting ML advice as personalized reserve prices when the platform consists ofautobidders who maximize value while respecting a return on ad spend (ROAS) constraint. Under parallel VCG auctions with such ML advice-based reserves, we present a worst-case welfare lower-bound guarantee for an individual autobidder, and show that the lower-bound guarantee is positively correlated with ML advice quality as well as the scale of bids induced by the autobidder's bidding strategies. Further, we show that no truthful, and possibly randomized mechanism with anonymous allocations can achieve universally better individual welfare guarantees than VCG, in the presence of personalized reserves based on ML-advice of equal quality. Moreover, we extend our individual welfare guarantee results to generalized first price (GFP) and generalized second price (GSP) auctions. Finally, we present numerical studies using semi-synthetic data derived from ad auction logs of a search ad platform to showcase improvements in individual welfare when setting personalized reserve prices with ML-advice. View details
    Multimodal Web Navigation with Instruction-Finetuned Foundation Models
    Hiroki Furuta
    Ofir Nachum
    Yutaka Matsuo
    Shane Gu
    Izzeddin Gur
    International Conference on Learning Representations (ICLR) (2024)
    Preview abstract The progress of autonomous web navigation has been hindered by the dependence on billions of exploratory interactions via online reinforcement learning, and domain-specific model designs that make it difficult to leverage generalization from rich out-of-domain data. In this work, we study data-driven offline training for web agents with vision-language foundation models. We propose an instruction-following multimodal agent, WebGUM, that observes both webpage screenshots and HTML pages and outputs web navigation actions, such as click and type. WebGUM is trained by jointly finetuning an instruction-finetuned language model and a vision encoder with temporal and local perception on a large corpus of demonstrations. We empirically demonstrate this recipe improves the agent's ability of grounded multimodal perception, HTML comprehension, and multi-step reasoning, outperforming prior works by a significant margin. On the MiniWoB, we improve over the previous best offline methods by more than 45.8%, even outperforming online-finetuned SoTA, humans, and GPT-4-based agent. On the WebShop benchmark, our 3-billion-parameter model achieves superior performance to the existing SoTA, PaLM-540B. Furthermore, WebGUM exhibits strong positive transfer to the real-world planning tasks on the Mind2Web. We also collect 347K high-quality demonstrations using our trained models, 38 times larger than prior work, and make them available to promote future research in this direction. View details
    On the Robustness of Image-based Malware Detection against Adversarial Attacks
    Yassine Mekdad
    Harun Oz
    Ahmet Aris
    Leonardo Babun
    Faraz Naseem
    Selcuk Uluagac
    Nasir Ghani
    Abbas Acar
    Network Security Empowered by Artificial Intelligence, Springer (2024)
    Preview abstract Machine and deep learning models are now one of the most valuable tools in the arsenal of computer security practitioners. Their success has been demonstrated in various network-security-oriented applications such as intrusion detection, cyber threat intelligence, vulnerability discovery, and malware detection. Nevertheless, recent research studies have shown that crafted adversarial samples can be used to evade malware detection models. Even though several defense mechanisms such as adversarial training have been proposed in the malware detection domain to address this issue, they unfortunately suffer from model poisoning and low detection accuracy. In this chapter, we assess the robustness of image-based malware classifier against four different adversarial attacks: (a) random and benign brute-force byte append attacks for black-box settings and (b) random and benign Fast Gradient Sign Method (FGSM) attacks for white-box settings. To this end, we implement a Convolutional Neural Network (CNN) to classify the image representations of Windows Portable Executable (PE) malware with a detection accuracy of 95.05%. Then, we evaluate its robustness along with MalConv, a state-of-the-art malware classifier, by applying a set of functionality-preserving adversarial attacks. Our experimental results demonstrate that image-based classifier exhibits a lower evasion rate of 5% compared to MalConv that achieves an evasion rate ranging between 44 and 54% in black-box settings. However, in white-box settings, both models fail against random byte and benign byte FGSM attacks, with an evasion rate of more than 46%. View details
    Preview abstract We present Spectron, a novel approach to adapting pre-trained large language models (LLMs) to perform spoken question answering (QA) and speech continuation. By endowing the LLM with a pre-trained speech encoder, our model becomes able to take speech inputs and generate speech outputs. The entire system is trained endto-end and operates directly on spectrograms, simplifying our architecture. Key to our approach is a training objective that jointly supervises speech recognition, text continuation, and speech synthesis using only paired speech-text pairs, enabling a ‘cross-modal’ chain-of-thought within a single decoding pass. Our method surpasses existing spoken language models in speaker preservation and semantic coherence. Furthermore, the proposed model improves upon direct initialization in retaining the knowledge of the original LLM as demonstrated through spoken QA datasets. We release our audio samples and spoken QA dataset via our website. View details
    Solidarity not Charity! Empowering Local Communities for Disaster Relief during COVID-19 through Grassroots Support
    Jeongwon Jo
    Oluwafunke Alliyu
    John M. Carroll
    Computer Supported Cooperative Work (2024) (2024)
    Preview abstract The COVID-19 pandemic brought wide-ranging, unanticipated societal changes as communities rushed to slow the spread of the novel coronavirus. In response, mutual aid groups bloomed online across the United States to fill in the gaps in social services and help local communities cope with infrastructural breakdowns. Unlike many previous disasters, the long-haul nature of COVID-19 necessitates sustained disaster relief efforts. In this paper, we conducted an interview study with online mutual aid group administrators to understand how groups facilitated disaster relief, and how disaster relief initiatives developed and maintained over the course of the first year of COVID-19. Our findings suggest that the groups were crucial sources of community-based support for immediate needs, innovated long-term solutions for chronic community issues and grew into a vehicle for justice-centered work. Our insights shed light on the strength of mutual aid as a community capacity that can support communities to collectively be more prepared for future long-haul disasters than they were with COVID-19. View details
    Preview abstract Quantum computing's transition from theory to reality has spurred the need for novel software tools to manage the increasing complexity, sophistication, toil, and chance for error of quantum algorithm development. We present Qualtran, an open-source library for representing and analyzing quantum algorithms. Using carefully chosen abstractions and data structures, we can simulate and test algorithms, automatically generate information-rich diagrams, and tabulate resource requirements. Qualtran offers a \emph{standard library} of algorithmic building blocks that are essential for modern cost-minimizing compilations. Its capabilities are showcased through the re-analysis of key algorithms in Hamiltonian simulation, chemistry, and cryptography. The resulting architecture-independent resource counts can be forwarded to our implementation of cost models to estimate physical costs like wall-clock time and number of physical qubits assuming a surface-code architecture. Qualtran provides a foundation for explicit constructions and reproducible analysis, fostering greater collaboration within the quantum algorithm development community. We believe tools like Qualtran will accelerate progress in the field. View details
    SoothSayer: Bypassing DSAC Mitigation by Predicting Counter Replacement
    Salman Qazi
    Fourth Workshop on DRAM Security (DRAMSec) (2024)
    Preview abstract In-DRAM Stochastic and Approximate Counting (DSAC) is a recently published algorithm that aims to mitigate Rowhammer at low cost. Existing in-DRAM counter-based schemes keep track of row activations and issue Targeted Row Refresh (TRR) upon detecting a concerning pattern. However, due to insufficiency of the tracking ability they are vulnerable to attacks utilizing decoy rows. DSAC claims to improve upon existing TRR mitigation by filtering out decoy-row accesses, so they cannot saturate the limited number of counters available for detecting Rowhammer, promising a reliable mitigation without the area cost of deterministic and provable schemes such as per-row activation counting (PRAC). In this paper, we analyze DSAC and discover some gaps that make it vulnerable to Rowhammer and Rowpress attacks. The main focus of this work is a novel attack named SoothSayer that targets the counter replacement policy in DSAC by cloning the random number generator. We describe and simulate this attack, and establish its efficacy. Finally, we discuss other weaknesses in DSAC. View details
    Preview abstract As AI systems quickly improve in both breadth and depth of performance, they lend themselves to creating increasingly powerful and realistic agents, including the possibility of agents modeled on specific people. We anticipate that within our lifetimes it may become common practice for people to create a custom AI agent to interact with loved ones and/or the broader world after death. We call these generative ghosts, since such agents will be capable of generating novel content rather than merely parroting content produced by their creator while living. In this paper, we first discuss the design space of potential implementations of generative ghosts. We then discuss the practical and ethical implications of generative ghosts, including potential positive and negative impacts on individuals and society. Based on these considerations, we lay out a research agenda for the AI and HCI research communities to empower people to create and interact with AI afterlives in a safe and beneficial manner. View details
    Fashion-VDM: Video Diffusion Model for Virtual Try-On
    Johanna Karras
    Yingwei Li
    Luyang Zhu
    Innfarn Yoo
    Andreas Lugmayr
    Chris Lee
    Fashion-VDM: Video Diffusion Model for Virtual Try-On (2024) (to appear)
    Preview abstract We present Fashion-VDM, a video diffusion model (VDM) for generating virtual try-on videos. Given an input garment image and person video, our method aims to generate a high-quality try-on video of the person wearing the given garment, while preserving the person's identity and motion. Image-based virtual try-on has shown impressive results; however, existing video virtual try-on (VVT) methods are still lacking garment details and temporal consistency. To address these issues, we propose a diffusion-based architecture for video virtual try-on, split classifier-free guidance for increased control over the conditioning inputs, and a progressive temporal training strategy for single-pass 64-frame, 512px video generation. We also demonstrate the effectiveness of joint image-video training for video try-on, especially when video data is limited. Our qualitative and quantitative experiments show that our approach sets the new state-of-the-art for video virtual try-on. For additional results, visit our project page: https://johannakarras.github.io/Fashion-VDM/ View details
    Neural general circulation models for weather and climate
    Dmitrii Kochkov
    Janni Yuval
    Jamie Smith
    Griffin Mooers
    Milan Kloewer
    James Lottes
    Peter Dueben
    Samuel Hatfield
    Peter Battaglia
    Alvaro Sanchez
    Matthew Willson
    Nature, 632 (2024), pp. 1060-1066
    Preview abstract General circulation models (GCMs) are the foundation of weather and climate prediction. GCMs are physics-based simulators that combine a numerical solver for large-scale dynamics with tuned representations for small-scale processes such as cloud formation. Recently, machine-learning models trained on reanalysis data have achieved comparable or better skill than GCMs for deterministic weather forecasting. However, these models have not demonstrated improved ensemble forecasts, or shown sufficient stability for long-term weather and climate simulations. Here we present a GCM that combines a differentiable solver for atmospheric dynamics with machine-learning components and show that it can generate forecasts of deterministic weather, ensemble weather and climate on par with the best machine-learning and physics-based methods. NeuralGCM is competitive with machine-learning models for one- to ten-day forecasts, and with the European Centre for Medium-Range Weather Forecasts ensemble prediction for one- to fifteen-day forecasts. With prescribed sea surface temperature, NeuralGCM can accurately track climate metrics for multiple decades, and climate forecasts with 140-kilometre resolution show emergent phenomena such as realistic frequency and trajectories of tropical cyclones. For both weather and climate, our approach offers orders of magnitude computational savings over conventional GCMs, although our model does not extrapolate to substantially different future climates. Our results show that end-to-end deep learning is compatible with tasks performed by conventional GCMs and can enhance the large-scale physical simulations that are essential for understanding and predicting the Earth system. View details
    Preview abstract In this article, we study the evolution of Android permissions. We describe the rationale behind key changes in Android’s permission model and disclose two permission-related security vulnerabilities we discovered. Finally, we provide developers actionable insights to proactively address permission-related security and privacy risks during development. View details
    UINav: A Practical Approach to Train On-Device Automation Agents
    Wei Li
    Fu-Lin Hsu
    Will Bishop
    Folawiyo Campbell-Ajala
    2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2024) - Industry Track
    Preview abstract Automation systems that can autonomously drive application user interfaces to complete user tasks are of great benefit, especially when users are situationally or permanently impaired. Prior automation systems do not produce generalizable models while AI-based automation agents work reliably only in simple, hand-crafted applications or incur high computation costs. We propose UINav, a demonstration-based approach to train automation agents that fit mobile devices, yet achieving high success rates with modest numbers of demonstrations. To reduce the demonstration overhead, UINav, uses a referee model that provides users with immediate feedback on tasks where the agent fails, and automatically augments human demonstrations to increase diversity in training data. Our evaluation shows that with only 10 demonstrations UINav, can achieve 70% accuracy, and that with enough demonstrations it can surpass 90% accuracy. View details
    Preview abstract Learning from Label Proportions (LLP) is a learning problem where only aggregate level labels are available for groups of instances, called bags, during training, and the aim is to get the best performance at the instance-level on the test data. This setting arises in domains like advertising and medicine due to privacy considerations. We propose a novel algorithmic framework for this problem that iteratively performs two main steps. For the first step (Pseudo Labeling) in every iteration, we define a Gibbs distribution over binary instance labels that incorporates a) covariate information through the constraint that instances with similar covariates should have similar labels and b) the bag level aggregated label. We then use Belief Propagation (BP) to marginalize the Gibbs distribution to obtain pseudo labels. In the second step (Embedding Refinement), we use the pseudo labels to provide supervision for a learner that yields a better embedding. Further, we iterate on the two steps again by using the second step's embeddings as new covariates for the next iteration. In the final iteration, a classifier is trained using the pseudo labels. Our algorithm displays strong gains against several SOTA baselines for the LLP Binary Classification problem on various dataset types - Small Tabular, Large Tabular and Images. We achieve these improvements with minimal computational overhead above standard supervised learning due to Belief Propagation, for large bag sizes, even for a million samples. View details