Publications

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

people standing in front of a screen with images and a chipboard

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
1 - 15 of 10464 publications
    Procurement Auctions via Approximate Submodular Optimization
    Amin Karbasi
    Grigoris Velegkas
    Forty-second International Conference on Machine Learning (2025)
    Preview abstract We study the problem of procurement auctions, in which an auctioneer seeks to acquire services from a group of strategic sellers with private costs. The quality of the services is measured through some \emph{submodular} function that is known to the auctioneer. Our goal is to design \emph{computationally efficient} procurement auctions that (approximately) maximize the difference between the quality of the acquired services and the total cost of the sellers, in a way that is incentive compatible (IC) and individual rational (IR) for the sellers, and generates non-negative surplus (NAS) for the auctioneer. Leveraging recent results from the literature of \emph{non-positive} submodular function maximization, we design computationally efficient frameworks that transform submodular function optimization algorithms to \emph{mechanisms} that are IC and IR for the sellers, NAS for the auctioneer, and \emph{approximation-preserving}. Our frameworks are general and work both in the \emph{offline} setting where the auctioneer can observe the bids and the services of all the sellers simultaneously, and in the \emph{online} setting where the sellers arrive in an adversarial order and the auctioneer has to make an irrevocable decision whether to purchase their service or not. We further investigate whether it is possible to convert state-of-art submodular optimization algorithms into a descending auction. We focurs in the adversarial setting, meaning that the schedule of the descending prices is determined by an advesary. We show that a submodular optimization algorithm satisfying bi-criteria $(\alpha, 1)$-approximation in welfare can be effectively converted to a descending auction in the adversarial setting in if and only if $\alpha \leq \frac 1 2$. Our result highlights the importance of a carefully designed schedule of descending prices to effectively convert a submodular optimization algorithm satisfying bi-criteria $(\alpha, 1)$-approximation in welfare with $\alpha > \frac 1 2$ to a descending auction. We also further establish a connection between descending auctions and online submodular optimization algorithms. We demonstrate the practical applications of our frameworks by instantiating them with different state-of-the-art submodular optimization algorithms and comparing their welfare performance through empirical experiments on publicly available datasets that consist of thousands of sellers. View details
    Preview abstract We revisit the fundamental question of formally defining what constitutes a reconstruction attack. While often clear from the context, our exploration reveals that a precise definition is much more nuanced than it appears, to the extent that a single all-encompassing definition may not exist. Thus, we employ a different strategy and aim to "sandwich" the concept of reconstruction attacks by addressing two complementing questions: (i) What conditions guarantee that a given system is protected against such attacks? (ii) Under what circumstances does a given attack clearly indicate that a system is not protected? More specifically, * We introduce a new definitional paradigm -- Narcissus Resiliency -- to formulate a security definition for protection against reconstruction attacks. This paradigm has a self-referential nature that enables it to circumvent shortcomings of previously studied notions of security. Furthermore, as a side-effect, we demonstrate that Narcissus resiliency captures as special cases multiple well-studied concepts including differential privacy and other security notions of one-way functions and encryption schemes. * We formulate a link between reconstruction attacks and Kolmogorov complexity. This allows us to put forward a criterion for evaluating when such attacks are convincingly successful. View details
    LLM-based Lossless Text Simplification and its Effect on User Comprehension and Cognitive Load
    Theo Guidroz
    Diego Ardila
    Jimmy Li
    Adam Mansour
    Paul Jhun
    Nina Gonzalez
    Xiang Ji
    Mike Sanchez
    Miguel Ángel Garrido
    Faruk Ahmed
    Divyansh Choudhary
    Jay Hartford
    Georgina Xu
    Henry Serrano
    Yifan Wang
    Jeff Shaffer
    Eric (Yifan) Cao
    Sho Fujiwara
    Peggy Bui
    arXiv (2025)
    Preview abstract Information on the web, such as scientific publications and Wikipedia, often surpasses users' reading level. To help address this, we used a self-refinement approach to develop a LLM capability for minimally lossy text simplification. To validate our approach, we conducted a randomized study involving 4563 participants and 31 texts spanning 6 broad subject areas: PubMed (biomedical scientific articles), biology, law, finance, literature/philosophy, and aerospace/computer science. Participants were randomized to viewing original or simplified texts in a subject area, and answered multiple-choice questions (MCQs) that tested their comprehension of the text. The participants were also asked to provide qualitative feedback such as task difficulty. Our results indicate that participants who read the simplified text answered more MCQs correctly than their counterparts who read the original text (3.9% absolute increase, p<0.05). This gain was most striking with PubMed (14.6%), while more moderate gains were observed for finance (5.5%), aerospace/computer science (3.8%) domains, and legal (3.5%). Notably, the results were robust to whether participants could refer back to the text while answering MCQs. The absolute accuracy decreased by up to ~9% for both original and simplified setups where participants could not refer back to the text, but the ~4% overall improvement persisted. Finally, participants' self-reported perceived ease based on a simplified NASA Task Load Index was greater for those who read the simplified text (absolute change on a 5-point scale 0.33, p<0.05). This randomized study, involving an order of magnitude more participants than prior works, demonstrates the potential of LLMs to make complex information easier to understand. Our work aims to enable a broader audience to better learn and make use of expert knowledge available on the web, improving information accessibility. View details
    Score-based Causal Representation Learning: Linear and General Transformations
    Abhishek Kumar
    Emre Acarturk
    Ali Tajer
    Burak Varici
    Journal of Machine Learning Research (JMLR) 2025 (2025)
    Preview abstract This paper addresses intervention-based causal representation learning (CRL) under a general nonparametric latent causal model and an unknown transformation that maps the latent variables to the observed variables. Linear and general transformations are investigated. The paper addresses both the identifiability and achievability aspects. Identifiability refers to determining algorithm-agnostic conditions that ensure recovering the true latent causal variables and the latent causal graph underlying them. Achievability refers to the algorithmic aspects and addresses designing algorithms that achieve identifiability guarantees. By drawing novel connections between score functions (i.e., the gradients of the logarithm of density functions) and CRL, this paper designs a score-based class of algorithms that ensures both identifiability and achievability. First, the paper focuses on linear transformations and shows that one stochastic hard intervention per node suffices to guarantee identifiability. It also provides partial identifiability guarantees for soft interventions, including identifiability up to ancestors for general causal models and perfect latent graph recovery for sufficiently non-linear causal models. Secondly, it focuses on general transformations and shows that two stochastic hard interventions per node suffice for identifiability. Notably, one does not need to know which pair of interventional environments have the same node intervened. View details
    Differentiable Approximations for Distance Queries
    David M. Mount
    Proceedings of the 2025 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA)
    Preview abstract The widespread use of gradient-based optimization has motivated the adaptation of various classical algorithms into differentiable solvers compatible with learning pipelines. In this paper, we investigate the enhancement of traditional geometric query problems such that the result consists of both the geometric function as well as its gradient. Specifically, we study the fundamental problem of distance queries against a set of points P in R^d, which also underlies various similarity measures for learning algorithms. The main result of this paper is a multiplicative (1+epsilon)-approximation of the Euclidean distance to P which is differentiable at all points in R^d \ P with asymptotically optimal bounds on the norms of its gradient and Hessian, from a data structure with storage and query time matching state-of-the-art results for approximate nearest-neighbor searching. The approximation is realized as a regularized distance through a partition-of-unity framework, which efficiently blends multiple local approximations, over a suitably defined covering of space, into a smooth global approximation. In order to obtain the local distance approximations in a manner that facilitates blending, we develop a new approximate Voronoi diagram based on a simple point-location data structure, simplifying away both the lifting transformation and ray shooting. View details
    On the relationship of speed limit and CO2 emissions in urban traffic
    Tamás Tettamanti
    Balázs Varga
    Ori Rottenstreich
    Transportation Research Interdisciplinary Perspectives, 32 (2025)
    Preview abstract The paper analyzes the relationship between urban speed limits and vehicle emissions. There is an ongoing trend of reducing speed limits from to for the sake of increasing road safety. However, the impact of this policy on emissions is still unclear. It can be mixed depending on the proportion of dynamic and steady-state driving. While cruising emissions are higher at lower speeds, lower speeds entail less acceleration in urban traffic. Based on our investigation, one network topology feature (road length) and two traffic-related parameters (traffic volume and turning ratio) have been suggested for analysis being the most relevant to affect vehicle emission. Their correlation with potential emission reduction was evaluated using high-fidelity traffic simulation based on traffic scenarios validated with real traffic data. Random forest regression was used to support the optimal selection of zones for speed limit reduction. Traffic simulations on large urban networks prove that emission reductions of over 10% can be achieved in the case of a well-chosen speed limit policy. View details
    PROTECT: A Framework to Foster Digital Resilience for Youth Navigating Technology-Facilitated Abuse
    Diana Freed
    Natalie Bazarova
    Dan Cosley
    Patrick Gage Kelley
    Social Sciences Journal, 14(6) (2025)
    Preview abstract Youth are increasingly exposed to a broad range of technology-facilitated abuse that challenges their safety and well-being. Building on previous work that examined youth help-seeking behaviors, coping strategies, threats they encounter, and the social support systems around them, we articulate a framework— called PROTECT—Problem recognition, Reaching out, Organizing support, Training, Engaging experts, Continuous support, and Tackling safety measures—which integrates existing models of support, help-seeking, and digital skills to offer a high-level, structured approach to adults who serve as a support system to youth navigate technology-facilitated abuse. The framework unpacks social and contextual dynamics that influence help-seeking behaviors, providing a foundation for educators, advocates, health professionals, developers and other adult stakeholders to design and develop trauma-informed, timely interventions to promote resilience. View details
    Neural Speech and Audio Coding
    Minje Kim
    IEEE Signal Processing Magazine, 41 (2025), pp. 85-93
    Preview abstract This paper explores the integration of model-based and data-driven approaches within the realm of neural speech and audio coding systems. It highlights the challenges posed by the subjective evaluation processes of speech and audio codecs and discusses the limitations of purely data-driven approaches, which often require inefficiently large architectures to match the performance of model-based methods. The study presents hybrid systems as a viable solution, offering significant improvements to the performance of conventional codecs through meticulously chosen design enhancements. Specifically, it introduces a neural network-based signal enhancer designed to post-process existing codecs’ output, along with the autoencoder-based end-to-end models and LPCNet—hybrid systems that combine linear predictive coding (LPC) with neural networks. Furthermore, the paper delves into predictive models operating within custom feature spaces (TF-Codec) or predefined transform domains (MDCTNet) and examines the use of psychoacoustically calibrated loss functions to train end-to-end neural audio codecs. Through these investigations, the paper demonstrates the potential of hybrid systems to advance the field of speech and audio coding by bridging the gap between traditional model-based approaches and modern data-driven techniques. View details
    Beyond the Crawl: Unmasking Browser Fingerprinting in Real User Interactions
    Muthu Selva Annamalai, Meenatchi Sundaram
    Emiliano De Cristofaro
    WWW (2025)
    Preview abstract Browser fingerprinting is an online tracking technique that is being increasingly adopted for profiling and ad targeting purposes. While prior work has analyzed the prevalence and impact of browser fingerprinting on the Web, they have traditionally relied on large-scale automated crawls. Naturally, these cannot replicate real-human interactions, e.g., solve CAPTCHAs, evade bot detectors, or operate behind login pages and paywalls. This prompts the question as to whether or not the fingerprinting ecosystem is appreciably different in real-world browsing sessions. In this paper, we begin to address this question by designing and conducting a user study aimed at collecting actual telemetry data from real browsing sessions of 30 users. We find that almost half of the fingerprinting websites identified from real user browsing sessions are missed by equivalent automated crawls. This is mainly due to the inability of automated crawls to identify and visit authentication pages, being blocked by bot detectors, and/or failing to perform user interactions that specifically trigger browser fingerprinting scripts. We also find new fingerprinting vectors that are consistently present in fingerprinting scripts captured by real user browsing sessions yet missing from automated crawls. Finally, we assess the feasibility of collecting fingerprinting training data in a privacy-preserving way. We conclude that private models built on real user browsing sessions can detect browser fingerprinting more effectively than models trained on automated crawls alone, while simultaneously providing strong privacy guarantees to users. View details
    Data-Driven Mechanism Design: Jointly Eliciting Preferences and Information
    Dirk Bergemann
    Marek Bojko
    Paul Duetting
    Haifeng Xu
    EC '25: Proceedings of the 26th ACM Conference on Economics and Computation (2025), pp. 507
    Preview abstract We study mechanism design when agents have private preferences and private information about a common payoff-relevant state. We show that standard message-driven mechanisms cannot implement socially efficient allocations when agents have multidimensional types, even under favorable conditions. To overcome this limitation, we propose data-driven mechanisms that leverage additional post-allocation information, modeled as an estimator of the payoff-relevant state. Our data-driven mechanisms extend the classic Vickrey-Clarke-Groves class. We show that they achieve exact implementation in posterior equilibrium when the state is either fully revealed or the utility is affine in an unbiased estimator. We also show that they achieve approximate implementation with a consistent estimator, converging to exact implementation as the estimator converges, and present bounds on the convergence rate. We demonstrate applications to digital advertising auctions and large language model (LLM)-based mechanisms, where user engagement naturally reveals relevant information. View details
    Benchmarking and improving algorithms for attributing satellite-observed contrails to flights
    Vincent Rudolf Meijer
    Rémi Chevallier
    Allie Duncan
    Kyle McConnaughay
    Atmospheric Measurement Techniques, 18 (2025), pp. 3495-3532
    Preview abstract Condensation trail (contrail) cirrus clouds cause a substantial fraction of aviation's climate impact. One proposed method for the mitigation of this impact involves modifying flight paths to avoid particular regions of the atmosphere that are conducive to the formation of persistent contrails, which can transform into contrail cirrus. Determining the success of such avoidance maneuvers can be achieved by ascertaining which flight formed each nearby contrail observed in satellite imagery. The same process can be used to assess the skill of contrail forecast models. The problem of contrail-to-flight attribution is complicated by several factors, such as the time required for a contrail to become visible in satellite imagery, high air traffic densities, and errors in wind data. Recent work has introduced automated algorithms for solving the attribution problem, but it lacks an evaluation against ground-truth data. In this work, we present a method for producing synthetic contrail detections with predetermined contrail-to-flight attributions that can be used to evaluate – or “benchmark” – and improve such attribution algorithms. The resulting performance metrics can be employed to understand the implications of using these observational data in downstream tasks, such as forecast model evaluation and the analysis of contrail avoidance trials, although the metrics do not directly quantify real-world performance. We also introduce a novel, highly scalable contrail-to-flight attribution algorithm that leverages the characteristic compounding of error induced by simulating contrail advection using numerical weather models. The benchmark shows an improvement of approximately 25 % in precision versus previous contrail-to-flight attribution algorithms, without compromising recall. View details
    A Scalable Framework for Evaluating Health Language Models
    Neil Mallinar
    Tony Faranesh
    Brent Winslow
    Nova Hammerquist
    Ben Graef
    Cathy Speed
    Mark Malhotra
    Shwetak Patel
    Xavi Prieto
    Daniel McDuff
    Ahmed Metwally
    (2025)
    Preview abstract Large language models (LLMs) have emerged as powerful tools for analyzing complex datasets. Recent studies demonstrate their potential to generate useful, personalized responses when provided with patient-specific health information that encompasses lifestyle, biomarkers, and context. As LLM-driven health applications are increasingly adopted, rigorous and efficient one-sided evaluation methodologies are crucial to ensure response quality across multiple dimensions, including accuracy, personalization and safety. Current evaluation practices for open-ended text responses heavily rely on human experts. This approach introduces human factors and is often cost-prohibitive, labor-intensive, and hinders scalability, especially in complex domains like healthcare where response assessment necessitates domain expertise and considers multifaceted patient data. In this work, we introduce Adaptive Precise Boolean rubrics: an evaluation framework that streamlines human and automated evaluation of open-ended questions by identifying gaps in model responses using a minimal set of targeted rubrics questions. Our approach is based on recent work in more general evaluation settings that contrasts a smaller set of complex evaluation targets with a larger set of more precise, granular targets answerable with simple boolean responses. We validate this approach in metabolic health, a domain encompassing diabetes, cardiovascular disease, and obesity. Our results demonstrate that Adaptive Precise Boolean rubrics yield higher inter-rater agreement among expert and non-expert human evaluators, and in automated assessments, compared to traditional Likert scales, while requiring approximately half the evaluation time of Likert-based methods. This enhanced efficiency, particularly in automated evaluation and non-expert contributions, paves the way for more extensive and cost-effective evaluation of LLMs in health. View details
    Preview abstract Estimating Origin-Destination (OD) travel demand is vital for effective urban planning and traffic management. Developing universally applicable OD estimation methodologies is significantly challenged by the pervasive scarcity of high-fidelity traffic data and the difficulty in obtaining city-specific prior OD estimates (or seed ODs), which are often prerequisite for traditional approaches. Our proposed method directly estimates OD travel demand by systematically leveraging aggregated, anonymized statistics from Google Maps Traffic Trends, obviating the need for conventional census or city-provided OD data. The OD demand is estimated by formulating a single-level, one-dimensional, continuous nonlinear optimization problem with nonlinear equality and bound constraints to replicate highway path travel times. The method achieves efficiency and scalability by employing a differentiable analytical macroscopic network model. This model by design is computationally lightweight, distinguished by its parsimonious parameterization that requires minimal calibration effort and its capacity for instantaneous evaluation. These attributes ensure the method's broad applicability and practical utility across diverse cities globally. Using segment sensor counts from Los Angeles and San Diego highway networks, we validate our proposed approach, demonstrating a two-thirds to three-quarters improvement in the fit to segment count data over a baseline. Beyond validation, we establish the method's scalability and robust performance in replicating path travel times across diverse highway networks, including Seattle, Orlando, Denver, Philadelphia, and Boston. In these expanded evaluations, our method not only aligns with simulation-based benchmarks but also achieves an average 13% improvement in it's ability to fit travel time data compared to the baseline during afternoon peak hours. View details
    Beyond Digital Literacy: Building Youth Digital Resilience Through Existing “Information Sensibility” Practices
    Mia Hassoun
    Ian Beacock
    Todd Carmody
    Patrick Gage Kelley
    Beth Goldberg
    Devika Kumar
    Laura Murray
    Rebekah Park
    Behzad Sarmadi
    Social Sciences Journal, 14(4) (2025)
    Preview abstract Youth media consumption and disordered eating practices have historically been subjects of moral panics, often resulting in protective, deficit-based interventions like content removal. We argue for interventions which instead equip youth to evaluate and manage risks in their online environments, building upon their existing “information sensibility” practices. Drawing upon ethnographic research and intervention testing with 77 participants in the US and India, we analyze how youth (aged 13–26), including those with diverse political perspectives and those recovering from disordered eating (DE), engage with online news and health information. Participants generally algorithmically encountered (rather than searched for) information online, and their engagement was shaped more by social motivations—like belonging—than truth seeking. Participants interpreted online information collaboratively, relying on social cues and peer validation within their online communities. They demonstrated preference for personal testimonies and relatable sources, particularly those with similar social identities. We propose resilience-building interventions that build upon these youth online information practices by: (1) leveraging peer networks, promoting critical information engagement through collaborative learning and peer-to-peer support within online communities; (2) developing social media sensibility, equipping youth to critically evaluate information sources in situ; (3) providing pathways offline, connecting youth to desired in-person communities; and (4) encouraging probabilistic thinking. View details
    Preview abstract Mainstream artificial neural network models, such as Deep Neural Networks (DNNs) are computation-heavy and energy-hungry. Weightless Neural Networks (WNNs) are natively built with RAM-based neurons and represent an entirely distinct type of neural network computing compared to DNNs. WNNs are extremely low-latency, low-energy, and suitable for efficient, accurate, edge inference. The WNN approach derives an implicit inspiration from the decoding process observed in the dendritic trees of biological neurons, making neurons based on Random Access Memories (RAMs) and/or Lookup Tables (LUTs) ready-to-deploy neuromorphic digital circuits. Since FPGAs are abundant in LUTs, LUT based WNNs are a natural fit for implementing edge inference in FPGAs. WNNs has been demonstrated to be an energetically efficient AI model, both in software, as well as in hardware. For instance, the most recent DWN – Differential Weightless Neural Network – model demonstrates up to 135× reduction in energy costs in FPGA implementations compared to other multiplication-free approaches, such as binary neural networks (BNNs) and DiffLogicNet, up to 9% higher accuracy in deployments on constrained devices, and culminate in up to 42.8× reduction in circuit area for ultra-low-cost chip implementations. This tutorial will help participants understand how WNNs work, why WNNs were underdogs for such a long time, and be introduced to the most recent members of the WNN family, such as BTHOWeN , LogicWiSARD, COIN, ULEEN and DWN, and contrast to BNNs and LogicNets. View details