Publications
Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.
Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.
Sort By
1 - 15 of 11197 publications
Preview abstract
Automating AI research differs from general software engineering due to computationally expensive evaluation (e.g., model training) and opaque performance attribution. Current LLM-based agents struggle here, often generating monolithic scripts that ignore execution costs and causal factors. We introduce MARS (Modular Agent with Reflective Search), a framework optimized for autonomous AI research. MARS relies on three pillars: (1) Budget-Aware Planning via cost-constrained Monte Carlo Tree Search (MCTS) to explicitly balance performance with execution expense; (2) Modular Construction, employing a "Design-Decompose-Implement" pipeline to manage complex research repositories; and (3) Comparative Reflective Memory, which addresses credit assignment by analyzing solution differences to distill high-signal insights. MARS achieves state-of-the-art performance among open-source frameworks on MLE-Bench under comparable settings, maintaining competitiveness with the global leaderboard's top methods. Furthermore, the system exhibits qualitative "Aha!" moments, where 63% of all utilized lessons originate from cross-branch transfer, demonstrating that the agent effectively generalizes insights across search paths.
View details
Preview abstract
Semantic data models express high-level business concepts and metrics, capturing the business logic needed to query a database correctly. Most data modeling solutions are built as layers above SQL query engines, with bespoke query languages or APIs. The layered approach means that semantic models can’t be used directly in SQL queries. This paper focuses on an open problem in this space – can we define semantic models in SQL, and make them naturally queryable in SQL?
In parallel, graph query is becoming increasingly popular, including in SQL. SQL/PGQ extends SQL with an embedded subset of the GQL graph query language, adding property graph views and making graph traversal queries easy.
We explore a surprising connection: semantic data models are graphs, and defining graphs is a data modeling problem. In both domains, users start by defining a graph model, and need query language support to easily traverse edges in the graph, which means doing joins in the underlying data.
We propose some useful SQL extensions that make it easier to use higher-level data model abstractions in queries. Users can define a “semantic data graph” view of their data, encapsulating the complex business logic required to query the underlying tables correctly. Then they can query that semantic graph model easily with SQL.
Our SQL extensions are useful independently, simplifying many queries – particularly, queries with joins. We make declared foreign key relationships usable for joins at query time – a feature that seems obvious but is notably missing in standard SQL.
In combination, these extensions provide a practical approach to extend SQL incrementally, bringing semantic modeling and graph query together with the relational model and SQL.
View details
Preview abstract
Deep-learning methods have boosted the analytical power of Raman spectroscopy, yet they still require large, task-specific, labeled datasets and often fail to transfer across application domains. The study explores pre-trained encoders as a solution. Pre-trained encoders have significantly impacted Natural Language Processing and Computer Vision with their ability to learn transferable representations that can be applied to a variety of datasets, significantly reducing the amount of time and data required to create capable models. The following work puts forward a new approach that applies these benefits to Raman Spectroscopy. The proposed approach, RSPTE (Raman Spectroscopy Pre-Trained Encoder), is designed to learn generalizable spectral representations without labels. RSPTE employs a novel domain adaptation strategy using unsupervised Barlow Twins decorrelation objectives to learn fundamental spectral patterns from multi-domain Raman Spectroscopy datasets containing samples from medicine, biology, and mineralogy. Transferability is demonstrated through evaluation on several models created by fine-tuning RSPTE for different application domains: Medicine (detection of Melanoma and COVID), Biology (Pathogen Identification), and Agriculture. As an example, using only 20% of the dataset, models trained with RSPTE achieve accuracies ranging 50%–86% (depending on the dataset used) while without RSPTE the range is 9%–57%. Using the full dataset, accuracies with RSPTE range 81%–97%, and without pretraining 51%–97%. Current methods and state-of-the-art models in Raman Spectroscopy are compared to RSPTE for context, and RSPTE exhibits competitive results, especially with less data as well. These results provide evidence that the proposed RSPTE model can effectively learn and transfer generalizable spectral features across different domains, achieving accurate results with less data in less time (both data collection time and training time).
View details
Preview abstract
We introduce AASE (Activation-based AI Safety Enforcement), a framework for post-perception safety monitoring in large language models. Unlike pre-perception approaches that analyze input or output text, AASE monitors the model's internal activation patterns—what the model "understands" rather than what text it processes or generates—enabling detection of safety-relevant states before harmful outputs are produced. The framework comprises three techniques: Activation Fingerprinting (AF) for harmful content detection, Agent Action Gating (AAG) for prompt injection defense, and Activation Policy Compliance (APC) for enterprise policy enforcement. We introduce paired contrastive training to isolate safety-relevant signals from confounding factors such as topic and style, addressing signal entanglement in polysemantic activations. Validation across 7 models from 3 architecture families shows strong class separation: Gemma-2-9B achieves AUC 1.00 with 7.2σ separation across all probes; AAG achieves AUC ≥0.88 across all models on the InjecAgent benchmark; APC achieves 0.97-1.00 AUC across three enterprise policies. Model size correlates with probe quality—Gemma-2-9B (7.2σ separation) outperforms Gemma-2-2B (4.3σ). All techniques survive INT4 quantization with minimal separation degradation. AASE is 9× faster than Llama Guard 3 (33ms vs 306ms) with higher TPR (88% vs 50%) at a tunable threshold that trades FPR for detection sensitivity, adding only 0.002ms probe overhead to existing inference.
View details
Usability Hasn’t Peaked: Exploring How Expressive Design Overcomes the Usability Plateau
Alyssa Sheehan
Bianca Gallardo
Ying Wang
Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI ’26), April 13–17, 2026, Barcelona, Spain (2026)
Preview abstract
Critics have argued that mobile usability has largely been optimized, and that only incremental gains are possible. We set out to explore if the newest generation of design systems, which promote greater flexibility and a return to design basics, could produce substantially more usable designs while maintaining or increasing aesthetic judgments. Through a study with 48 diverse participants completing tasks in 10 different applications, we found that in designs created following Material 3 Expressive guidelines, users fixated on the correct screen element for a task 33% faster, completed tasks 20% faster, and rated experiences more positively compared to versions designed using the previous Material design system. These improvements in performance and aesthetic ratings challenge the premise of a usability plateau and show that mobile usability has not peaked. We illustrate specific opportunities to make mobile experiences more usable by returning to design fundamentals while highlighting risks of added flexibility.
View details
Agentic AI Infrastructure in Practice: Learn These Key Hurdles to Deploy Production AI Agents Efficiently
https://swisscognitive.ch/ (2026)
Preview abstract
The emergence of Agentic AI—autonomous systems capable of reasoning, decision-making, and multi-step execution—represents a paradigm shift in enterprise technology. Moving beyond simple generative tasks, these agents offer the potential to solve long-standing industry pain points, with over 90% of enterprises planning integration within the next three years. However, the transition from successful proof-of-concept (PoC) to a resilient, production-grade system presents significant hurdles.
This article categorizes these challenges into three primary domains:
Technical and Engineering Hurdles: Issues such as "entangled workflows" that complicate debugging, the struggle to maintain output quality and mitigate hallucinations, and the unpredictability caused by shifting underlying models or data sources.
People, Process, and Ecosystem Hurdles: The high operational costs and unclear ROI of large models, the necessity of a new "Agent Ops" skillset, the complexity of integrating agents with disparate enterprise systems, and a rapidly evolving regulatory landscape.
The Pace of Change and Security risks: The technical debt incurred by shifting software frameworks and the expanded attack surface created by autonomous agents.
The article concludes that successful deployment requires a shift from informal "vibe-testing" to rigorous engineering discipline. By adopting code-first frameworks, establishing robust evaluation metrics (KPIs), and prioritizing functional deployment over theoretical optimization, organizations can effectively manage the lifecycle of Agentic AI and realize its transformative business value.
View details
Neural general circulation models for modeling precipitation
Stephan Hoyer
Dmitrii Kochkov
Janni Yuval
Ian Langmore
Science Advances (2026)
Preview abstract
Climate models struggle to accurately simulate precipitation, particularly extremes and the diurnal cycle. While hybrid models combining machine learning and physics have emerged with the premise of improving precipitation simulations, none have proven sufficiently skillful or stable enough to outperform existing models in simulating precipitation.
Here, we present the first hybrid model that is trained directly on precipitation observations. The model runs at 2.8 degrees resolution and is built on the differentiable NeuralGCM framework. This model is stable for decadal simulations and demonstrates significant improvements over existing GCMs, ERA5 reanalysis, and a Global Cloud-Resolving Model in simulating precipitation. Our approach yields reduced biases, a more realistic precipitation distribution, improved representation of extremes, and a more accurate diurnal cycle.
Furthermore, it outperforms the ECMWF ensemble for mid-range weather forecasting.
This advance paves the way for more reliable simulations of current climate and for the ability to fully utilize the abundance of existing observations to further improve GCMs.
View details
Preview abstract
How many T gates are needed to approximate an arbitrary n-qubit quantum state to within
a given precision ϵ? Improving prior work of Low, Kliuchnikov and Schaeffer, we show that the
optimal asymptotic scaling is Θ(sqrt{2^n log(1/ε)} + log(1/ε)) if we allow an unlimited number of ancilla qubits. We also show that this is the optimal T-count for implementing an arbitrary
diagonal n-qubit unitary to within error ϵ. We describe an application to batched synthesis of
single-qubit unitaries: we can approximate a tensor product of m = O(log log(1/ϵ)) arbitrary
single-qubit unitaries to within error ϵ with the same asymptotic T-count as is required to
approximate just one single-qubit unitary.
View details
An Empirical Study of Tablet Ergonomics: The Interplay of Temperature, Orientation, and Use Behaviors
Carmen Van Ommen
Mikki Phan
Arun Raghupathy
Daniel Huynh
Barbara Chaparro
Ergonomics in Design: The Quarterly of Human Factors Applications (2026)
Preview abstract
To balance computational performance with thermal comfort, this study explores a consolidated hotspot architecture at the top center of a tablet. We tested hotspot (39°C, 43°C, 45°C, 47°C) and ambient temperatures (25°C, 35°C) with 60 participants, measuring perception, action likelihood, and expectation. The hotspot was observed away from high contact areas, with 43°C identified as the threshold for significant discomfort. Discomfort increased with portrait mode use and higher device and ambient temperatures, while active use duration influenced acceptability. The findings underscore the importance of thermal mapping and contextual sensing, with direct applications for software throttling thresholds of coated aluminum enclosures.
View details
The Ontic-Epistemic Distinction: Implications for General Intelligence
Master's Thesis (2026) (to appear)
Preview abstract
The current pursuit of robust machine intelligence is largely predicated on a substrate independent, computational functionalist view of cognition, where sufficiently complex computational processing is expected to eventually yield generalized reasoning. This paper explores the ontological distinctions between these computational frameworks and biological cognition, specifically how these differences impact the capacity for semantic understanding. By analyzing phenomena such as the "reversal curse" where models fail to generalize the symmetry in identity relations (A=B implies B=A), and performance on novel reasoning benchmarks (e.g., ARC-AGI), this paper examines whether current model limitations are transient artifacts of scale or indicative of a distinct architectural category. Integrating Stevan Harnad’s “symbol grounding problem” with Evan Thompson’s biological model of “intrinsic normativity,” I investigate whether robust general intelligence might require sense-making: a process distinct from information processing, whereby an agent’s internal states are causally coupled with its environment via survival or system-wide stakes which grounds symbols in meaning. Current Large Language Models (LLMs) appear to lack this intrinsic normativity, and consequently may operate primarily as epistemic instruments rather than ontic agents. By introducing the concept of “ontic grounding”, this paper presents a potential framework for distinguishing between the simulation of reasoning and true understanding, which could have implications for AI safety and governance.
View details
Phoenix: Rowhammer Attacks on DDR5 with Self-Correcting Synchronization
Preview
Michele Marazzi
Kaveh Razavi
Salman Qazi
Diego Meyer
Patrick Jattke
IEEE Security & Privacy (S&P) (2026)
Preview abstract
Source-to-source compilers may perform inefficiently by executing transpilation passes on scripts that do not contain the specific language features a pass is designed to transform, potentially leading to redundant processing. A compiler can analyze a script to generate a per-script feature map, for example, by identifying language features in its abstract syntax tree (AST). Before executing a transpilation pass, the compiler can check this map and may bypass the pass for that script if the specific feature targeted by the pass is not present. This feature map can also be dynamically updated throughout the compilation process as other passes transform the code. This method of conditional pass execution based on content-aware analysis may reduce redundant AST traversals, which could decrease overall compilation time and computational resource consumption.
View details
"It didn’t feel right but I needed a job so desperately": Understanding People's Emotions & Help Needs During Financial Scams
G. Jake Chanenson
Tara Matthews
Jessica McClearn
Sarah Meiklejohn
Mia Hassoun
2026
Preview abstract
Online financial scams represent a long-standing and serious threat for which people seek help. We present a study to understand people’s in situ motivations for engaging with scams and the help needs they express before, during, and after encountering a scam. We identify the main emotions scammers exploited (e.g., fear, hope) and characterize how they did so. We examine factors—such as financial insecurity and legal precarity—which elevate people’s risk of engaging with specific scams and experiencing harm. We indicate when people sought help and describe their help-seeking needs and emotions at different stages of the scam. We discuss how these needs could be met through the design of contextually-specific prevention, diagnostic, mitigation, and recovery interventions.
View details
The Perfection Paradox: From Architect to Curator in AI-Assisted API Design
JJ Geewax
David R Karger
Extended Abstracts of the 2026 CHI Conference on Human Factors in Computing Systems (CHI EA '26), ACM, Barcelona, Spain, TBD
Preview abstract
Enterprise API design is often bottlenecked by the tension between rapid feature delivery and the rigorous maintenance of usability standards. We present an industrial case study evaluating an AI-assisted design workflow trained on API Improvement Proposals(AIPs). Through a controlled study with 16 industry experts, we compared AI-generated API specifications against human-authored ones. While quantitative results indicated AI superiority in 10 of 11 usability dimensions and an 87% reduction in authoring time, qualitative analysis revealed a paradox: experts frequently misidentified AI work as human (19% accuracy) yet described the designs as unsettlingly “perfect.” We characterize this as a “Perfection Paradox”—where hyper-consistency signals a lack of pragmatic human judgment. We discuss the implications of this perfection paradox, proposing a shift in the human designer’s role from the “drafter” of specifications to the “curator” of AI-generated patterns.
View details
Preview abstract
Enterprise service centers, particularly in domains like People Operations, are critical hubs of organizational knowledge work. They face a persistent difficulty in disseminating the tacit, case-specific expertise of senior agents, which can lead to inconsistent service and slower onboarding for new hires. While existing Knowledge Management (KM) and Case-Based Reasoning (CBR) systems have improved the retrieval of historically similar cases, they inadvertently shift the cognitive burden of synthesizing this information to the time-constrained agent. This paper introduces the Dynamic Case Precedent (DCP) architecture, a novel socio-technical framework designed to address this gap. The DCP architecture moves beyond simple precedent recommendation to automated precedent synthesis. It achieves this by integrating a semantic retrieval model with the large-context reasoning capabilities of a generative Large Language Model (LLM). We propose a three-pillar framework—(1) Contextual Similarity Indexing, (2) Generative Insight Synthesis, and (3) Human-in-the-Loop Refinement. By analyzing multiple relevant historical cases to generate a concise summary of resolution patterns, the DCP architecture aims to reduce agent cognitive load, accelerate proficiency, and improve service consistency. This conceptual framework offers a new model for human-AI collaboration, framing the AI not as a mere information tool, but as an active partner in sensemaking.
View details