Publications

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

people standing in front of a screen with images and a chipboard

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
1 - 15 of 11342 publications
Preview abstract We consider a setting where we have a ground set ℳ together with real-valued set functions f₁, … , f_n, and the goal is to partition ℳ into two sets S₁,S₂ such that |f_i(S₁) - f_i(S₂)| is small for every i. Many results in discrepancy theory can be stated in this form with the functions f_i being additive. In this work, we initiate the study of the unstructured case where f_i is not assumed to be additive. We show that even without the additivity assumption, the upper bound remains at most O(√{n log n}). Our result has implications on the fair allocation of indivisible goods. In particular, we show that a consensus halving up to O(√{n log n}) goods always exists for n agents with monotone utilities. Previously, only an O(n) bound was known for this setting. View details
Visual Planning: Let’s Think Only with Images
Han Zhou
Caiqi Zhang
Anna Korhonen
Chengzu Li
Yi Xu
Ivan Vulic
International Conference on Learning Representations (ICLR) (2026)
Preview abstract Recent advancements in Large Language Models (LLMs) and their multimodal extensions (MLLMs) have significantly enhanced machine reasoning across diverse tasks. However, these models predominantly rely on language as the medium for both expressing and structuring reasoning, even when visual information is present. In this work, we argue that language may not always be the most natural or effective modality for reasoning, particularly in tasks involving spatial, geometric, or physical dynamics. Motivated by this, we propose a new paradigm, Visual Planning, which enables planning through purely visual representations, independent of textual mediation. In this paradigm, planning is executed via sequences of images that encode step-by-step inference in the visual domain, akin to how humans sketch or visualize future actions. We then introduce a novel two-stage reinforcement learning framework empowered by GRPO for post-training large vision models, resulting in substantial improvements in planning accuracy and generalization across both seen and novel scenarios, validated in representative visual navigation tasks, FrozenLake and Maze. Our results establish Visual Planning as a viable and promising alternative to language-based reasoning, opening new avenues for tasks that benefit from intuitive, image-based inference. View details
Preview abstract We introduce a new context-enriched time series forecasting benchmark TimesX. TimesX contains a wide selection of high-quality real-world time series and diverse textual contexts from an automated generating pipeline, which helps address three main issues of existing benchmarks: (1) poor generalization due to low data volume and data being synthetic, (2) restricted forms of context, and (3) an inability to mitigate data leakage. We conduct a thorough empirical study of current multimodal solutions on TimesX. Our results suggest that most multimodal solutions that work well on existing benchmarks may fail on TimesX. In contrast, simple ensemble methods that leverage the rich textual context can outperform strong unimodal baselines and other multimodal baselines. ** Below this is what was submitted to ITP. ** We create a real world multimodal time-series forecasting benchmark that encompasses diverse domains and regions. Each time-series is annotated by various kinds of contexts like metadata, date and holiday information, dynamic events related to the time-series. This is sufficiently more advanced than other available benchmarks which rely wither on static metadata alone or synthetic examples. This forms a test bed for multimodal forecasting. We also present some baseline results showing that ensembles of publicly available LLMs and time-series foundation models can demonstrate non-trivial performance on this bechmark. View details
Preview abstract A growing body of qualitative research has identified contextual risk factors that elevate people’s chances of experiencing digital-safety attacks. However, the lack of quantitative data on the population level distribution of these risk factors prevents policymakers and tech companies from developing targeted, evidence-based interventions to improve digital safety. To address this gap, we surveyed 5,001 adults in the United States to analyze: (1) the frequency of and relationship between digital-safety attacks (e.g., scams, harassment, account hacking), and (2) how these attacks align with 10 contextual risk factors. Nearly half of our respondents identify as resource constrained, which significantly correlates with higher likelihood of experiencing four common attacks. We also present qualitative insights to expand our understanding of the factors beyond the existing literature (e.g., “prominence” included high-visibility roles in local communities). This study provides the first large-scale quantitative analysis correlating digital-safety attacks with contextual risk factors and demographics. View details
Preview abstract The management of a hybrid workforce comprising human and autonomous computational agents may be challenged by the use of separate systems for human capital and software assets, which can create a governance gap. A system can provide a unified framework for managing a hybrid workforce. For example, the system may utilize a labor service mesh to analyze and route tasks to either a human intent tier or an agentic execution tier. A potential principle of the system is structural symmetry, where computational agents can be assigned digital identities and managed through a lifecycle process that may parallel human resource functions, such as onboarding, performance evaluation, and structured offboarding. This integrated approach can facilitate a unified system of record and governance model for an organization's intelligence capacity. View details
Progressive Photorealistic Simplification
Adi Rosenthal
Yedid Hoshen
Arik Shamir
2026
Preview abstract Existing image simplification techniques often rely on Non-Photorealistic Rendering (NPR), transforming photographs into stylized sketches, cartoons, or paintings. While effective at reducing visual complexity, such approaches typically sacrifice photographic realism. In this work, we explore a complementary direction: simplifying images while preserving their photorealistic appearance. We introduce progressive semantic image simplification, a framework that iteratively reduces scene complexity by removing and inpainting elements in a controlled manner. At each step, the resulting image remains a plausible natural photograph. Our method combines semantic understanding with generative editing, leveraging Vision-Language Models (VLMs) to identify and prioritize elements for removal, and a learned verifier to ensure photorealism and coherence throughout the process. This is implemented via an iterative \emph{Select–Remove–Verify} pipeline that produces high-quality simplification trajectories. To improve efficiency, we further distill this process into an image-to-video generation model that directly predicts coherent simplification sequences from a single input image. Beyond generating cleaner and more focused compositions, our approach enables applications such as content-aware decluttering, semantic layer decomposition, and interactive editing. More broadly, our work suggests that simplification through structured content removal can serve as a practical mechanism for guiding visual interpretation within the photorealistic domain, complementing traditional abstraction methods. View details
ALF: Advertiser Large Foundation Model for Multi-Modal Advertiser Understanding
Sunny Rajagopalan
Alireza Golestaneh
Shubhra Chandra
Min Zhou
Jonathan Vronsky
Songbai Yan
2026
Preview abstract We present ALF (Advertiser Large Foundation model), a multi-modal transformer architecture for understanding advertiser behavior and intent across text, image, video and structured data modalities. Through contrastive learning and multi-task optimization, ALF creates unified advertiser representations that capture both content and behavioral patterns. Our model achieves state-of-the-art performance on critical tasks including fraud detection, policy violation identification, and advertiser similarity matching. In production deployment, ALF reduces false positives by 90\% while maintaining 99.8\% precision on abuse detection tasks. The architecture's effectiveness stems from its novel combination of multi-modal transformations, intersample attention mechanism, spectrally normalized projections, and calibrated probabilistic outputs. View details
Preview abstract Some artificial intelligence provisioning models that function as tools for human users or rely on labor arbitrage can present challenges for organizations, such as managing personnel rather than task outcomes and introducing data security risks. An architecture is described for an outcome-based synthetic labor market in which autonomous computational agents can be compensated based on verified task completion. The framework can leverage trusted execution environments to create secure hardware enclaves for processing sensitive data, which can render the data cryptographically inaccessible to a host system or agent provider. This approach can facilitate a secure, transactional market for autonomous professional execution, which may enable a shift from managing labor resources to procuring verified outcomes from a pool of specialized agents. View details
Preview abstract Modern user interfaces are complex composites, with elements originating from various sources, such as the operating system, apps, a web browser, or websites. Many security and privacy models implicitly depend on users correctly identifying an element's source, a concept we term ''surface attribution.'' Through two large-scale vignette-based surveys (N=4,400 and N=3,057), we present the first empirical measurement of this ability. We find that users struggle, correctly attributing UI source only 55% of the time on desktop and 53% on mobile. Familiarity and strong brand cues significantly improve accuracy, whereas UI positioning, a long-held security design concept especially for browsers, has minimal impact. Furthermore, simply adding a ''Security & Privacy'' brand cue to Android permission prompts failed to improve attribution. These findings demonstrate a fundamental gap in users' mental models, indicating that relying on them to distinguish trusted UI is a fragile security paradigm. View details
Agentic Coding Needs Proactivity, Not Just Autonomy
Georgios Evangelopoulos
(2026) (to appear)
Preview abstract Coding agents are rapidly changing the landscape of software development, moving from inline com- pletion to autonomous systems that edit repositories, open pull requests, respond to issues, and run scheduled or webhook triggered routines across the development life cycle. The next generation is increasingly described as proactive and long-horizon: agents should notice relevant changes before the developer asks, connect signals across tools, decide when to interrupt, and carry preferences across sessions. Yet the field lacks a precise account of what proactivity means for software development, how it differs from autonomy, what acceptance criteria proactive long-horizon tasks should satisfy, and which metrics determine whether unsolicited agent behavior is useful rather than merely active. We argue that proactive coding agents should be evaluated by the quality and improvement of their insight policy: the policy that decides what matters next, what evidence supports it, whether to surface it, and how to adapt after feedback. We re-anchor this view in mixed initiative interaction, introduce a three level taxonomy (Reactive, Scheduled, and Situation Aware), compare contemporary coding agents against five operational criteria, and sketch an active user simulation protocol with three evaluation targets: Insight Decision Quality (IDQ), Context Grounding Score (CGS), and Learning Lift (LL). View details
Reinforcement Learning with Discrete Diffusion Policies for Combinatorial Action-Spaces
Haitong Ma
Ofir Nabati
Na Li
Shie Mannor
Guy Tennenholtz
Proceedings of the 43rd International Conference on Machine Learning (ICML-26), Seoul, South Korea (2026)
Preview abstract Reinforcement learning (RL) algorithms have achieved superhuman performance on many sequential decision-making tasks, but often struggle in domains with large, combinatorial action spaces. To address this, we introduce a practical and stable algorithm for training discrete diffusion models to represent policies in such environments. We formulate a policy mirror descent algorithm that enhances training stability by reframing policy optimization as an inference problem, which naturally aligns with the learning objective of discrete diffusion models. Through extensive experiments on a suite of challenging benchmark tasks, we demonstrate that our approach achieves significant improvements over existing methods in both performance and sample efficiency. This work opens a promising new direction for applying discrete diffusion models in RL to tackle long-standing challenges in large-scale combinatorial action spaces. View details
What does your wearable know about the festive season?
Justin Phillips
Katarina Vukosavljević
Abram Schönfeldt
YongSuk Cho
Conor Heneghan
Robert Harle
(2026)
Preview abstract As we reach the end of the year and people look forward to spending quality time with loved ones, here at Fitbit, we wonder what our Pixel watches and Fitbit trackers can tell us about how we are spending the festive season. We looked at the data of 11.8 million of our users all over the world between January 2022 and July 2025. Here are the key stats we wanted to share with you! View details
Preview abstract The rapid expansion of the Internet of Things (IoT) and smart home ecosystems has led to a fragmented landscape of user data management across consumer electronics (CE) such as Smart TVs, gaming consoles, and set-top boxes. Current onboarding processes on these devices are characterized by high friction due to manual data entry and opaque data-sharing practices. This paper introduces the User Data Sharing System (UDSS), a platform-agnostic framework designed to facilitate secure, privacy-first PII (Personally Identifiable Information) exchange between device platforms and third-party applications. Our system implements a Contextual Scope Enforcement (CSE) mechanism that programmatically restricts data exposure based on user intent—specifically distinguishing between Sign-In and Sign-Up workflows. Unlike cloud-anchored identity standards such as FIDO2/WebAuthn, UDSS is designed for shared, device-centric CE environments where persistent user-to-device bind-ing cannot be assumed. We further propose a tiered access model that balances developer needs with regulatory compliance (GDPR/CCPA). A proof-of-concept implementation on a reference ARMv8 Linux-based middleware demonstrates that UDSS reduces user onboarding latency by 65% and measurably reduces PII over-exposure risk through protocol-enforced data minimization. This framework provides a standardized approach to identity management in the heterogeneous CE market. View details
Preview abstract Managing compiler build errors that can arise during infrastructure upgrades in large, polyglot codebases may be challenging, as manual remediation can be slow and some automated tools may not support modern language syntax. A system can provide automated error remediation by ingesting compiler diagnostics and analyzing source code using an Abstract Syntax Tree (AST). A recursive scope resolution algorithm, for example, can traverse the AST to identify a specific and narrowly-scoped code block at which to apply an error suppression. Conversely, this algorithmic complexity can be bypassed when lexical scope resolution is not required, and the system can identify the specific location of error suppressions directly from the error's exact coordinates. The system may then generate and apply language-specific patches, such as structured comments for JavaScript source files or line-scoped comments for TypeScript source files, for example, by using a transactional rewrite engine. This approach can provide a scalable method for managing automated code remediation, which may facilitate infrastructure upgrades by reducing the need for manual intervention. View details
Preview abstract As the ECMAScript specification evolves, industrial-scale JavaScript compilers face the challenge of supporting modern language syntax while maintaining compatibility for diverse execution environments. Traditionally, compilers solve this by running transpilation passes in a monolithic pipeline, where the transpilation passes are chosen to execute strictly based on a target language level. This results in significant computational waste, as compilers perform expensive Abstract Syntax Tree (AST) traversals to lower features that may not exist in the actual input source code. We present a static analysis improvement that conditionally executes transpiler passes based on accurately tracking and dynamically maintaining the exact set of language features seen in the compilation unit throughout the transpilation process. It is implemented in the production Google Closure Compiler. By populating and maintaining a FeatureSet at every JavaScript script-level, it dynamically skips running the unnecessary lowering passes. We detail the architectural safeguards - including strategic pass ordering and dynamic validation of the transpiled code for feature-correctness. Evaluation of this improvement on large-scale production applications produced a considerable reduction in compilation time and saved compute and memory usage. View details
×