Publications

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

people standing in front of a screen with images and a chipboard

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
1 - 15 of 11318 publications
    Preview abstract This disclosure describes systems and methods for a multi-agent framework that can automate and scale cognitive work. The framework can, for example, use a cognitive assembly line of specialized computational agents to perform tasks such as research and drafting. A beneficial component could be an adversarial review panel (ARP), which is a multi-agent review system where distinct agent personas critique a generated draft from varied perspectives. The structured feedback from the ARP can be used to automatically iterate on and refine the work product. This approach can improve the intellectual rigor of generated content and reduce the time required for production, which may allow human operators to focus on activities such as strategic oversight and final validation. View details
    Approximate vs Precise: An experiment in what impacts user choice when apps request location access
    Extended Abstracts of the 2026 CHI Conference on Human Factors in Computing Systems (CHI EA ’26), April 13–17, 2026, Barcelona, Spain (2026)
    Preview abstract User location data is highly sensitive, yet commonly requested by mobile apps for both core functionality and monetization. To improve user privacy, the major mobile platforms, Android and iOS, made changes so that when apps request precise location access, users can choose to share only their approximate location. However, the platforms have diverging interfaces: Android offers a side-by-side choice and iOS offers a corner toggle. This study evaluates which factors impact users’ choices when apps request location access via a randomized controlled experiment with 2579 US Android users. We tested the impact of app type, whether a reason for the request was provided, and the quality and content of the reason, including monetization. We do not find the reasons have an effect. Instead, we find users’ choices are impacted by app type and user demographics. We find that when users are given a side-by-side choice to allow approximate versus precise location access, they make reasonable choices. Of users who allowed access, the vast majority (90.7%) chose precise for a rideshare app versus the majority (71.3%) chose approximate for a local news app. Concerningly, the majority also allowed location access to a wallpaper app, and older users were significantly more likely to allow apps precise location access. We conclude by discussing implications for app platforms and future work. View details
    Beyond Vector Similarity: Hierarchical Context-Aware Graph RAG vs Standard RAG in Enterprise Code Migration
    Suddhasatwa Bhaumik
    Nilesh Jaiswal
    Arjit Shukla
    Divya Malhotra
    Aniket Agrawal
    Saurabh Garg
    Suchit Puri
    Google Cloud India, Google, S. No, AP81, 83, N Main Rd, near Hard Rock Cafe, Koregaon Park Annexe, Mundhwa, Pune, Maharashtra 411036 (2026)
    Preview abstract As enterprises modernize legacy systems (e.g., monolithic Java architectures to Python microservices), Large Language Models (LLMs) have become instrumental in automated code translation. However, traditional vector-based Retrieval-Augmented Generation (Standard RAG) struggles with topological relationships, fetching isolated text chunks that frequently sever inheritance chains and lead to high compilation failure rates. This paper presents a comparative analysis between Standard RAG and a novel Hierarchical Context-Resident Graph (HCRG) methodology. Our pipeline utilizes tree-sitter for polyglot Abstract Syntax Tree (AST) extraction, mapping architectural edges into a Google Cloud Spanner Property Graph, and serializing this structure into a Gemini (on Vertex AI) Context Cache to enable topological, parent-first code translation. By shifting evaluation from naive text-overlap to a custom 7-metric framework measuring Software Engineering (SE) utility, empirical evaluations on the spring-petclinic-genai repository demonstrate significant structural improvements. Graph RAG decisively mitigates dependency loss, dropping the API hallucination rate from 56.4% to 16.2%. Furthermore, it improves Dependency Resolution Quality (DRQ) from 34.8% to 65.9% and enhances Parent-Child Consistency (PCC) from 26.7% to 45.5%. Interestingly, traditional lexical metrics fail to capture this divergence; both methodologies achieved an identical 91% average CodeBLEU score, effectively masking Standard RAG’s structural failures behind syntactically plausible but broken code. However, the results indicate that Graph RAG is not strictly superior across all dimensions. Providing the LLM with dense, global structural context introduces new vulnerabilities: Graph RAG suffers a severe degradation in Cyclomatic Complexity Consistency (dropping from Standard RAG’s 71.6% to 46.7%) due to defensive over-engineering by the LLM, alongside a slight drop in Docstring Preservation (67.0% down to 61.0%) caused by prompt attention dilution. Ultimately, this research validates that while Graph RAG trades an increase in code complexity for critical reductions in API hallucinations, it offers a substantially more viable and architecturally sound path for automated enterprise codebase modernisation. View details
    Preview abstract This whitepaper seeks to elucidate implications that the capabilities of developing quantum architectures have on blockchain vulnerabilities and mitigation strategies. First, we provide new resource estimates for breaking the 256-bit Elliptic Curve Discrete Logarithm Problem, the core of modern blockchain cryptography. We demonstrate that Shor's algorithm for this problem can execute with either <1200 logical qubits and <90 million Toffoli gates or <1450 logical qubits and <70 million Toffoli gates. In the interest of responsible disclosure, we use a zero-knowledge proof to validate these results without disclosing attack vectors. On superconducting architectures with 1e-3 physical error rates and planar connectivity, those circuits can execute in minutes using fewer than half a million physical qubits. We introduce a critical distinction between fast-clock (such as superconducting and photonic) and slow-clock (such as neutral atom and ion trap) architectures. Our analysis reveals that the first fast-clock CRQCs would enable on-spend attacks on public mempool transactions of some cryptocurrencies. We survey major cryptocurrency vulnerabilities through this lens, identifying systemic risks associated with advanced features in some blockchains such as smart contracts, Proof-of-Stake consensus, and Data Availability Sampling, as well as the enduring concern of abandoned assets. We argue that technical solutions would benefit from accompanying public policy and discuss various frameworks of digital salvage to regulate the recovery or destruction of dormant assets while preventing adversarial seizure. We also discuss implications for other digital assets and tokenization as well as challenges and successful examples of the ongoing transition to Post-Quantum Cryptography (PQC). Finally, we urge all vulnerable cryptocurrency communities to join the ongoing migration to PQC without delay. View details
    Preview abstract In modern Kubernetes environments, eBPF (Extended Berkeley Packet Filter) has become the de facto standard for high-performance dataplane enforcement. However, this architecture introduces a complex distributed state problem: the asynchronous synchronization between the Kubernetes control plane (Intent) and the kernel-space BPF maps (Reality). A critical failure mode, termed “Silent Divergence,” occurs when the control plane believes a network policy or identity is applied, but the underlying kernel state is missing or corrupted. In this “Gray Failure” state, standard observability tools—including logs, liveness probes, and agent status checks—report health, while the network silently drops traffic. This paper introduces eBPF-Auditor, a specialized consistency verification framework. Unlike standard agents that rely on event-based reconciliation, eBPF-Auditor performs a periodic “Two-Way State Audit” that mathematically verifies the intersection of Kubernetes Intent and BPF Reality. We demonstrate through fault injection and benchmarks on 5,000 pods that this approach successfully detects state drift with 100% accuracy and negligible sub-millisecond overhead (ms), making it a viable solution for high-frequency runtime verification in production hyperscale clusters. View details
    Preview abstract Biological neurons come in many shapes. High-fidelity generative modeling of their varied morphologies is challenging yet underexplored in neuroscience, and crucial for the subfield of connectomics. We introduce MoGen (Neuronal Morphology Generation), a flow matching model to generate high-resolution 3D point clouds of mouse cortex axon and dendrite fragments. This is enabled by an adaptation that injects local geometric context into a scalable latent transformer backbone, allowing for the generation of high-fidelity, realistic samples. To assess MoGen's generation quality, we propose a dedicated evaluation suite with interpretable geometric and topological features tailored to neuronal structures that we validate in a user study. MoGen's practical utility is showcased through controllable generation for visualization via smooth interpolation and a direct downstream application: we augment the training set of a shape plausibility classifier from a production connectomics neuron reconstruction pipeline with millions of generated samples, thereby improving classifier accuracy and reducing the number of remaining split and merge errors by 4.4%. We estimate this can reduce manual proofreading labor by over 157 person-years for reconstruction of a full mouse brain. View details
    Who Controls the Curriculum for AI? The Limits of Participatory Design for Educational AI
    Michael Madaio
    Learning Under Algorithmic Conditions, University of Minnesota Press (2026)
    Preview abstract Participatory design is a long-standing effort to shift control over technology design from technologists to users and communities impacted by technologies. For educational AI, this means involving students, families, teachers, and other stakeholders in shaping the design of AI systems. While promising, in this article, I situate the recent calls for participatory design of educational AI systems within a different historical tradition—that of contests over local control of educational curricula. I argue that approaches that attempt to steer the design and development of educational AI through participatory methods may inadvertently reproduce the history of political contestation of educational curricula, in ways that may privilege the most powerful communities, rather than those inequitably impacted. What might it look like to treat participatory AI design as a site for political contestation? How might these approaches avoid reproducing the same majoritarian tendencies that led to educational inequities in the first place? View details
    Preview abstract Artificial intelligence is rapidly evolving, marked by the emergence of Large Language Model (LLM) agents – systems capable of complex reasoning, planning, and interaction with digital and physical environments. These agents, powered by advancements in LLMs, demonstrate remarkable capabilities across diverse domains, including finance, healthcare, web navigation, software development, and daily task assistance. Unlike traditional AI systems, LLM agents can perceive their surroundings, formulate multi-step plans, utilize external tools and APIs, access memory or knowledge bases, and execute actions to achieve specified goals. This ability to act upon the world, however, introduces significant safety and security challenges. The safety paradigms developed for traditional LLMs, primarily focused on mitigating harmful textual outputs (e.g., toxicity, bias), are insufficient for safeguarding LLM agents. Agents interacting with dynamic environments and executing actions present a broader attack surface and new categories of risk. These include performing unsafe operations, violating privacy constraints through improper data handling or access control failures, deviating from user objectives (task misalignment), and susceptibility to novel manipulation techniques like indirect prompt injection and memory poisoning. Ensuring the trustworthy operation of these powerful agents is paramount, especially as they are integrated into high-stakes applications. To address this critical challenge, we introduce VeriGuard, a novel framework designed to enhance the safety and reliability of LLM agents by interactively verifying their policies and the actions. VeriGuard integrates a verification module that intercepts code-based actions proposed by the agent. In the first step, VeriGuard will generates and verifies the policies. The policies are rigorously checked against a set of predefined safety and security specifications Then each action will be verified to make sure it will align with the agent specification. This interactive verification loop ensures that the agent's behavior remains within safe operational bounds, effectively preventing the execution of harmful or unintended operations. By verifying each step, VeriGuard provides a robust safeguard, substantially improving the trustworthiness of LLM agents in complex, real-world environments. View details
    Preview abstract A growing body of qualitative research has identified contextual risk factors that elevate people’s chances of experiencing digital-safety attacks. However, the lack of quantitative data on the population level distribution of these risk factors prevents policymakers and tech companies from developing targeted, evidence-based interventions to improve digital safety. To address this gap, we surveyed 5,001 adults in the United States to analyze: (1) the frequency of and relationship between digital-safety attacks (e.g., scams, harassment, account hacking), and (2) how these attacks align with 10 contextual risk factors. Nearly half of our respondents identify as resource constrained, which significantly correlates with higher likelihood of experiencing four common attacks. We also present qualitative insights to expand our understanding of the factors beyond the existing literature (e.g., “prominence” included high-visibility roles in local communities). This study provides the first large-scale quantitative analysis correlating digital-safety attacks with contextual risk factors and demographics. View details
    Preview abstract Optimizing large-language model (LLM) training and serving on large-sacle distributed systems with hundreds and thousands of accelerators is always a challenging task due to the fast evloving LLMs, strong domain expertise required, and various optimization goals from different worklaods. Existing methods rely on either handcrafted optimization performed by human experts, which is tedious and time-consuming or resource-intensive black-box searches, which lack the extensibility to keep pace with evolving models and hardware. To address this, we introduce PROMPTS, a novel multi-agent framework that complements traditional search methods with expert-informed reasoning. It automates the diagnosis of performance bottlenecks by synthesizing profiler data and leverages a knowledge base to propose optimized sharding configurations with detailed justifications. Across eight real-world production workloads, PROMPTS demonstrated remarkable efficiency and accuracy, delivering performance improvements of up to 434%. These workloads spanned diverse model architectures, hardware platforms, computational scales, and various stages of the machine learning lifecycle (pre-training, serving, and post-training). In every case, the configuration adopted by human engineers was identified within the agent's top three proposals from a single invocation. Furthermore, the agent's top-ranked recommendation was the one ultimately adopted in 87.5% of cases, showcasing its ability to not only find optimized solutions, but also to correctly prioritize them. Our work establishes PROMPTS as a scalable, extensible, and explainable methodology for AI-assisted performance engineering in large-scale ML systems. View details
    Preview abstract We introduce ALPS (Activation-based Length Prediction for Scheduling), a method for predicting LLM generation length from prefill activations before any tokens are generated. Unlike existing approaches that require model fine-tuning or complex entropy-weighted pooling, ALPS uses a simple linear probe on the last-token activation at intermediate layers. We discover that generation length is encoded in prefill representations: a ridge regression probe achieves R-squared > 0.85 across three model families. Validation across Llama-3.1-8B, Gemma-2-9B, and Qwen-2.5-7B demonstrates: (1) intermediate layers generally perform well, with some architectural variation; (2) simple last-token extraction outperforms complex pooling strategies; (3) activations improve substantially over surface-feature baselines (24 percentage points over input length plus lexical features). The best models achieve R-squared = 0.943 (Gemma), R-squared = 0.880 (Llama), and R-squared = 0.857 (Qwen) with MAE of 38-80 tokens. All test prompts terminated naturally (100% EOS), eliminating truncation confounds. While our evaluation uses 200 curated prompts—sufficient for demonstrating the phenomenon but requiring broader validation—cross-validation confirms generalization beyond training data. ALPS enables practical applications including budget-constrained inference, request scheduling, and resource allocation. The probe adds negligible overhead (~16KB direction vector, single dot product), making ALPS practical for production deployment. View details
    Preview abstract Large Language Models (LLMs) such as ChatGPT can infer personal attributes from seemingly innocuous text, raising privacy risks beyond memorized data leakage. While prior work has demonstrated these risks, little is known about how users estimate and respond. We conducted a survey with 240 U.S. participants who judged text snippets for inference risks, reported concern levels, and attempted rewrites to block inference. We compared their rewrites with those generated by ChatGPT and Rescriber, a state-of-the-art sanitization tool. Results show that participants struggled to anticipate inference, performing a little better than chance. User rewrites were effective in just 28% of cases - better than Rescriber but worse than ChatGPT. We examined our participants’ rewriting strategies, and observed that while paraphrasing was the most common strategy it is also the least effective; instead abstraction and adding ambiguity were more successful. Our work highlights the importance of inference-aware design in LLM interactions. View details
    Preview abstract The management of a hybrid workforce comprising human and autonomous computational agents may be challenged by the use of separate systems for human capital and software assets, which can create a governance gap. A system can provide a unified framework for managing a hybrid workforce. For example, the system may utilize a labor service mesh to analyze and route tasks to either a human intent tier or an agentic execution tier. A potential principle of the system is structural symmetry, where computational agents can be assigned digital identities and managed through a lifecycle process that may parallel human resource functions, such as onboarding, performance evaluation, and structured offboarding. This integrated approach can facilitate a unified system of record and governance model for an organization's intelligence capacity. View details
    Bi-level Hierarchical Neural Contextual Bandits for Online Recommendation
    Yunzhe Qi
    Yikun Ban
    Allan Stewart
    Chuanwei Ruan
    Jiachuan He
    Shishir Kumar Prasad
    Haixun Wang
    Jingrui He
    Transactions on Machine Learning Research (2026)
    Preview abstract Contextual bandit algorithms aim to identify the optimal choice among a set of candidate arms, based on their contextual information. Among others, the neural contextual bandit algorithms have demonstrated generally superior performance compared to traditional linear and kernel-based methods. Nevertheless, neural methods are not inherently suitable to handle a large number of candidate arms due to their high computational cost when performing neural exploration. Motivated by the widespread availability of arm category information (e.g., movie genres, retailer types), we formulate contextual bandits into a bi-level recommendation problem based on the accessible arm category information, and propose a novel neural bandit framework, named H2N-Bandit, which utilizes a bi-level hierarchical neural structure to mitigate the substantial computational cost found in conventional neural bandit methods. To demonstrate its effectiveness, we provide the regret bound for H2N-Bandit under the over-parameterized neural bandit settings. Furthermore, to illustrate its efficiency, we conduct extensive experiments on multiple real-world public data sets with various specifications, showing that H2N-Bandit can significantly reduce the computational cost over existing non-linear methods while achieving better or comparable performances against state-of-the-art baselines. View details
    A Framework for Interactive Machine Learning and Enhanced Conversational Systems
    Jerry Young
    Richard Abisla
    Sanjay Batra
    Mikki Phan
    Nature, Springer-Verlag (2026)
    Preview abstract Conversational systems are increasingly prevalent, yet current versions often fail to support the full range of human speech, including variations in speed, rhythm, syntax, grammar, articulation, and resonance. This reduces their utility for individuals with dysarthria, apraxia, dysphonia, and other language and speech-related disabilities. Building on research that emphasizes the need for specialized datasets and model training tools, our study uses a scaffolded approach to understand the ideal model training and voice recording process. Our findings highlight two distinct user flows for improving model training and provide six guidelines for future conversational system-related co-design frameworks. This study offers important insights on creating more effective conversational systems by emphasizing the need to integrate interactive machine learning into training strategies. View details
    ×