Publications
Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.
Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.
Sort By
1 - 15 of 11130 publications
Agentic AI Infrastructure in Practice: Learn These Key Hurdles to Deploy Production AI Agents Efficiently
https://swisscognitive.ch/ (2026)
Preview abstract
The emergence of Agentic AI—autonomous systems capable of reasoning, decision-making, and multi-step execution—represents a paradigm shift in enterprise technology. Moving beyond simple generative tasks, these agents offer the potential to solve long-standing industry pain points, with over 90% of enterprises planning integration within the next three years. However, the transition from successful proof-of-concept (PoC) to a resilient, production-grade system presents significant hurdles.
This article categorizes these challenges into three primary domains:
Technical and Engineering Hurdles: Issues such as "entangled workflows" that complicate debugging, the struggle to maintain output quality and mitigate hallucinations, and the unpredictability caused by shifting underlying models or data sources.
People, Process, and Ecosystem Hurdles: The high operational costs and unclear ROI of large models, the necessity of a new "Agent Ops" skillset, the complexity of integrating agents with disparate enterprise systems, and a rapidly evolving regulatory landscape.
The Pace of Change and Security risks: The technical debt incurred by shifting software frameworks and the expanded attack surface created by autonomous agents.
The article concludes that successful deployment requires a shift from informal "vibe-testing" to rigorous engineering discipline. By adopting code-first frameworks, establishing robust evaluation metrics (KPIs), and prioritizing functional deployment over theoretical optimization, organizations can effectively manage the lifecycle of Agentic AI and realize its transformative business value.
View details
Preview abstract
Semantic data models express high-level business concepts and metrics, capturing the business logic needed to query a database correctly. Most data modeling solutions are built as layers above SQL query engines, with bespoke query languages or APIs. The layered approach means that semantic models can’t be used directly in SQL queries. This paper focuses on an open problem in this space – can we define semantic models in SQL, and make them naturally queryable in SQL?
In parallel, graph query is becoming increasingly popular, including in SQL. SQL/PGQ extends SQL with an embedded subset of the GQL graph query language, adding property graph views and making graph traversal queries easy.
We explore a surprising connection: semantic data models are graphs, and defining graphs is a data modeling problem. In both domains, users start by defining a graph model, and need query language support to easily traverse edges in the graph, which means doing joins in the underlying data.
We propose some useful SQL extensions that make it easier to use higher-level data model abstractions in queries. Users can define a “semantic data graph” view of their data, encapsulating the complex business logic required to query the underlying tables correctly. Then they can query that semantic graph model easily with SQL.
Our SQL extensions are useful independently, simplifying many queries – particularly, queries with joins. We make declared foreign key relationships usable for joins at query time – a feature that seems obvious but is notably missing in standard SQL.
In combination, these extensions provide a practical approach to extend SQL incrementally, bringing semantic modeling and graph query together with the relational model and SQL.
View details
A Framework for Interactive Machine Learning and Enhanced Conversational Systems
Jerry Young
Richard Abisla
Sanjay Batra
Mikki Phan
Nature, Springer-Verlag (2026)
Preview abstract
Conversational systems are increasingly prevalent, yet current versions often fail to support the full range of human speech, including variations in speed, rhythm, syntax, grammar, articulation, and resonance. This reduces their utility for individuals with dysarthria, apraxia, dysphonia, and other language and speech-related disabilities. Building on research that emphasizes the need for specialized datasets and model training tools, our study uses a scaffolded approach to understand the ideal model training and voice recording process. Our findings highlight two distinct user flows for improving model training and provide six guidelines for future conversational system-related co-design frameworks. This study offers important insights on creating more effective conversational systems by emphasizing the need to integrate interactive machine learning into training strategies.
View details
Preview abstract
We study the problem of allocating access point bandwidth to users of a wireless network in the presence of adversarial jamming. Specifically, we consider a setting in which the network designer acts first and allocates access point bandwidth to the users of the network, before an adversary applies a jamming strategy to reduce the bandwidth of a subset (or all) of the access points. We consider a strong adversary who has complete information and can optimize the jamming strategy, subject to power budget constraints. In turn, the network designer must allocate the resources in anticipation of the adversary's actions.
We explain that our model gives rise to a special network interdiction model, which differs from the standard setting in two ways: The first is that the interdictor is given the benefit of responding, rather than leading the game. The second is that the interdiction is fractional and performed at the node level of the network. The interdiction then propagates to all edges incident to the access point.
In terms of technical results, we provide an allocation algorithm that is based on linear programming duality and show that the algorithm can solve the problem optimally, assuming knowledge of the adversary's budget constraints. We conduct experiments on synthetic data to show the extent to which the algorithm improves the total utilized bandwidth over the algorithm that optimizes bandwidth allocation while being oblivious to the adversary's existence.
View details
Preview abstract
Responsive user interfaces enable dynamically adjusting user interfaces based on device-specific aspects such as screen size, aspect ratio, display resolution, etc. However, traditional responsive design fails to account for different types of constraints of a user and task criticality of the task being performed via the UI. Misalignment between the UI design, user context and task criticality can lead to user error. This disclosure describes techniques, implemented with user permission, for dynamically modifying the layout, information density, and/or interactive physics of a user interface based on a dual-factor analysis of user cognitive state and task criticality. The user's cognitive state can be inferred from behavioral telematics. Task criticality can be inferred from semantic analysis. The information density and other parameters of a user interface are automatically adjusted based on such analyses. Such adjustments include applying or relaxing restrictions on interactivity and adjusting visual prominence of various UI elements to adjust the information density of the user interface. The adjustments can also include adjusting friction as appropriate, hiding certain aspects of the user interface, or other types of adjustments.
View details
ARM MTE Performance in Practice
Preview
Taehyun Noh
Yingchen Wang
Tal Garfinkel
Mahesh Madhav
Mattan Erez
Shravan Narayan
Usenix Security (2026)
Unveiling the Global Landscape of Android Security Updates
Haiyun Deng
Abbas Acar
Esteban Luques
Harun Oz
Ahmet Aris
Selcuk Uluagac
IEEE Transactions on Dependable and Secure Computing (2026)
Preview abstract
Android is the world’s leading mobile operating
system, with over three billion active devices. Detecting vulnerabilities and ensuring timely patch deployment are critical to
maintaining security. The Android Open Source Project (AOSP)
has enhanced the transparency of security updates through Security Patch Levels. However, challenges related to update speed
and availability persist. In 2022, Google reported that half of the
zero-day vulnerabilities discovered in the wild were variations of
vulnerabilities that had already been patched. Recent research
mainly highlights delays in update distribution, often attributing
them to fragmentation and focusing primarily on flagship devices
or limited time-frames. Our approach takes a device-centric
perspective to investigate Android update patterns, analyzing
567K security update records from 2014 to 2024, covering 904
distinct devices from six key Original Equipment Manufacturers
(OEMs) across 98 countries. Our extensive analysis revealed
notable differences in update release timing across OEMs, device types, and regions. Our study also examines documented
vulnerabilities and weaknesses, while assessing OEM compliance
with Android security guidelines. Our study shows that ∼89.7%
of vulnerabilities on unpatched Android devices are exploitable
without user interaction and with low attack complexity. We
also identified delays linked to fragmentation and OEM-specific
challenges, and provide actionable insights for improvement.
View details
Who Controls the Curriculum for AI? The Limits of Participatory Design for Educational AI
Michael Madaio
Learning Under Algorithmic Conditions, University of Minnesota Press (2026)
Preview abstract
Participatory design is a long-standing effort to shift control over technology design from technologists to users and communities impacted by technologies. For educational AI, this means involving students, families, teachers, and other stakeholders in shaping the design of AI systems. While promising, in this article, I situate the recent calls for participatory design of educational AI systems within a different historical tradition—that of contests over local control of educational curricula. I argue that approaches that attempt to steer the design and development of educational AI through participatory methods may inadvertently reproduce the history of political contestation of educational curricula, in ways that may privilege the most powerful communities, rather than those inequitably impacted. What might it look like to treat participatory AI design as a site for political contestation? How might these approaches avoid reproducing the same majoritarian tendencies that led to educational inequities in the first place?
View details
Preview abstract
This article delves into how Google Site Reliability Engineers (SREs) leverage Gemini 3 and the Gemini CLI to aggressively reduce Mean Time to Mitigation (MTTM) during real-world outages. By focusing on the SRE motto of "Eliminate Toil," the article walks through a simulated incident, demonstrating how an agentic CLI acts as a human-in-the-loop copilot across the entire incident lifecycle: from initial paging and investigation, through safe, tool-driven mitigation and root cause analysis, to automated postmortem generation and action item filing. This direct integration of Gemini's reasoning capabilities with operational data and internal tools creates a virtuous cycle where past incident learnings continuously inform and improve future solutions.
View details
Silicon-Level Sovereignty: Root of Trust in AI Accelerators (Digital Trust & Policy)
https://www.dotmagazine.online (2026)
Preview abstract
As artificial intelligence (AI) transitions from experimental pilot programs to mission-critical enterprise operations, traditional software-based security frameworks are proving insufficient against sophisticated infrastructure-level threats. This article introduces the concept of Silicon-Level Sovereignty, a first-principles approach to digital trust that anchors security in the physical hardware rather than the software stack.
We examine the technical architecture of Hardware Root of Trust (RoT), specifically focusing on the roles of Trusted Platform Modules (TPMs) and Secure Enclaves in modern AI accelerators such as GPUs and TPUs. By leveraging cryptographic remote attestation, organizations can move from a model of assumed software integrity to one of verifiable hardware-level proof.
The discussion provides a comparative analysis of industry-leading implementations, including NVIDIA’s Hopper architecture [1, 2], Google’s Titan-backed TPU v5p [3, 4], and Microsoft’s Azure Boost Cerberus system [5, 6], alongside the cluster-scale trust challenges presented by ultra-large systems like xAI’s Colossus [7].
The article concludes that Silicon-Level Sovereignty is no longer an optional security feature but a foundational requirement for establishing the integrity, privacy, and multi-tenant isolation necessary for high-stakes AI workloads.
View details
VISTA: A Test-Time Self-Improving Video Generation Agent
Hootan Nakhost
Xuan Long Do
The IEEE/CVF Conference on Computer Vision and Pattern Recognition (to appear) (2026)
Preview abstract
Despite rapid advances in text-to-video (T2V) synthesis, generated video quality remains critically dependent on precise user prompts. Existing test-time optimization methods, successful in other domains, struggle with the multi-faceted nature of video. To address this, we introduce VISTA, a novel multi-agent system that autonomously refines prompts to improve video generation. VISTA operates in an iterative loop, first decomposing a user's idea into a structured temporal plan. After generation, the best video is identified through a robust pairwise tournament. This winning video is then critiqued by a trio of specialized agents focusing on visual, audio, and contextual fidelity. Finally, a reasoning agent synthesizes this feedback to introspectively rewrite and enhance the prompt for the next generation cycle. To rigorously evaluate our proposed approach, we introduce MovieGen-Bench, a new benchmark of diverse single- and multi-scene video generation tasks. Experiments show that while prior methods yield inconsistent gains, VISTA consistently improves video quality, achieving up to 60% pairwise win rate against state-of-the-art baselines. Human evaluators concur, preferring VISTA's outputs in 68% of comparisons.
View details
Preview abstract
Modern user interfaces are complex composites, with elements originating from various sources, such as the operating system, apps, a web browser, or websites. Many security and privacy models implicitly depend on users correctly identifying an element's source, a concept we term ''surface attribution.'' Through two large-scale vignette-based surveys (N=4,400 and N=3,057), we present the first empirical measurement of this ability.
We find that users struggle, correctly attributing UI source only 55% of the time on desktop and 53% on mobile. Familiarity and strong brand cues significantly improve accuracy, whereas UI positioning, a long-held security design concept especially for browsers, has minimal impact. Furthermore, simply adding a ''Security & Privacy'' brand cue to Android permission prompts failed to improve attribution. These findings demonstrate a fundamental gap in users' mental models, indicating that relying on them to distinguish trusted UI is a fragile security paradigm.
View details
Preview abstract
AI coding assistants are rapidly becoming integral to modern software development. A key challenge in this space is the continual need to migrate and modernize codebases in response to evolving software ecosystems. Traditionally, such migrations have relied on rule-based systems and human intervention. With the advent of powerful large language models (LLMs), AI-driven agentic frameworks offer a promising alternative—but their effectiveness remains underexplored. In this paper, we introduce FreshBrew, a novel benchmark for evaluating AI-based agentic frameworks on project-level Java migrations. We benchmark several such frameworks, powered by state-of-the-art LLMs, and compare their performance against established rule-based tools. Our evaluation of AI agents on this benchmark of 228 repositories shows that the top-performing model, Gemini 2.5 Flash, can successfully migrate 56.5% of projects to JDK 17. Our empirical analysis reveals novel insights into the critical strengths and limitations of current agentic approaches, offering actionable insights into their real-world applicability. By releasing FreshBrew publicly upon acceptance, we aim to facilitate rigorous, reproducible evaluation and catalyze progress in AI-driven codebase modernization.
View details
ConvApparel: A Benchmark Dataset and Validation Framework for User Simulators in Conversational Recommenders
Jihwan Jeong
The 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL-26), Rabat, Morocco (2026)
Preview abstract
LLM-based user simulators are a scalable solution for improving conversational AI, but a critical realism gap undermines their effectiveness. To close this gap, we introduce a framework for building and validating high-fidelity simulators. We present a novel dataset of human-AI shopping conversations designed to capture a wide spectrum of user experiences. To measure fidelity, we propose a hybrid evaluation protocol that combines statistical alignment with a learned, discriminator-based Human-Likeness Score. Our most sophisticated simulator, trained via reinforcement learning with iterative critique, achieves a significant leap in realism. Critically, we demonstrate through counterfactual validation that our simulator—trained exclusively on optimal interactions—realistically adapts its behavior to suboptimal system responses, mirroring real user reactions and marking a key advance in creating reliable simulators for robust AI development.
View details
Preview abstract
For many practical applications of quantum computing, the slowest and most costly steps involve coherently accessing classical data. We help address this challenge by applying mass production techniques, which can sometimes allow us to perform operations many times in parallel for a cost that is comparable to a single execution[1-3]. We combine existing mass-production results with modern approaches for loading classical data using ``quantum read-only memory.'' We show that quantum mass production techniques offer no benefit when we consider a cost model that focuses purely on the number of non-Clifford gates. However, analyzing the constant factors in a more nuanced cost model, we find that it may be possible to obtain a reduction in cost of an order or magnitude or more for a variety reasonably-sized fault-tolerant quantum algorithms. We present several applications of quantum mass-production techniques beyond naive parallelization, including a strategy for reducing the cost of serial calls to the same data loading step.
View details