Publications

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

people standing in front of a screen with images and a chipboard

Our teams aspire to make discoveries that impact everyone, and core to our approach is sharing our research and tools to fuel progress in the field.

Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
1 - 15 of 10129 publications
    Preview abstract Motivated by recent advances in large language models for NLP, we design a time-series foundation model for forecasting whose out-of-the-box zero-shot performance on a variety of datasets, matches the accuracy of state-of-the-art supervised forecasting models for each individual dataset. Our model is based on pretraining a patched-decoder style attention model on a large time series dataset, and can work well across different forecasting history lengths, prediction lengths and temporal granularities. View details
    Generalized Power Attacks against Crypto Hardware using Long-Range Deep Learning
    Karel Král
    Marina Zhang
    Transactions on Cryptographic Hardware and Embedded Systems (TCHES), IACR (2024)
    Preview abstract To make cryptographic processors more resilient against side-channel attacks, engineers have developed various countermeasures. However, the effectiveness of these countermeasures is often uncertain, as it depends on the complex interplay between software and hardware. Assessing a countermeasure’s effectiveness using profiling techniques or machine learning so far requires significant expertise and effort to be adapted to new targets which makes those assessments expensive. We argue that including cost-effective automated attacks will help chip design teams to quickly evaluate their countermeasures during the development phase, paving the way to more secure chips.In this paper, we lay the foundations toward such automated system by proposing GPAM, the first deep-learning system for power side-channel analysis that generalizes across multiple cryptographic algorithms, implementations, and side-channel countermeasures without the need for manual tuning or trace preprocessing. We demonstrate GPAM’s capability by successfully attacking four hardened hardware-accelerated elliptic-curve digital-signature implementations. We showcase GPAM’s ability to generalize across multiple algorithms by attacking a protected AES implementation and achieving comparable performance to state-of-the-art attacks, but without manual trace curation and within a limited budget. We release our data and models as an open-source contribution to allow the community to independently replicate our results and build on them. View details
    PRewrite: Prompt Rewriting with Reinforcement Learning
    Qiaozhu Mei
    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (2024) (to appear)
    Preview abstract Prompt engineering is critical for the development of LLM-based applications. However, it is usually done manually in a "trial and error" fashion that can be time consuming, ineffective, and sub-optimal. Even for the prompts which seemingly work well, there is always a lingering question: can the prompts be made better with further modifications? To address these problems, we investigate automated prompt engineering in this paper. Specifically, we propose PRewrite, an automated method to rewrite an under-optimized prompt to a more effective prompt. We instantiate the prompt rewriter using an LLM. The rewriter LLM is trained using reinforcement learning to optimize the performance on a given downstream task. We conduct experiments on diverse benchmark datasets, which demonstrates the effectiveness of PRewrite. View details
    DySLIM: Dynamics Stable Learning by Invariant Measure for Chaotic Systems
    Yair Schiff
    Jeff Parker
    Volodymyr Kuleshov
    International Conference on Machine Learning (ICML) (2024)
    Preview abstract Learning dynamics from dissipative chaotic systems is notoriously difficult due to their inherent instability, as formalized by their positive Lyapunov exponents, which exponentially amplify errors in the learned dynamics. However, many of these systems exhibit ergodicity and an attractor: a compact and highly complex manifold, to which trajectories converge in finite-time, that supports an invariant measure, i.e., a probability distribution that is invariant under the action of the dynamics, which dictates the long-term statistical behavior of the system. In this work, we leverage this structure to propose a new framework that targets learning the invariant measure as well as the dynamics, in contrast with typical methods that only target the misfit between trajectories, which often leads to divergence as the trajectories’ length increases. We use our framework to propose a tractable and sample efficient objective that can be used with any existing learning objectives. Our Dynamics Stable Learning by Invariant Measure (DySLIM) objective enables model training that achieves better point-wise tracking and long-term statistical accuracy relative to other learning objectives. By targeting the distribution with a scalable regularization term, we hope that this approach can be extended to more complex systems exhibiting slowly-variant distributions, such as weather and climate models. Code to reproduce our experiments is available here: https://github.com/google-research/swirl-dynamics/tree/main/swirl_dynamics/projects/ergodic. View details
    Preview abstract Prompting and in-context learning (ICL) have become efficient learning paradigms for large language models (LLMs). However, LLMs suffer from prompt brittleness and various bias factors in the prompt, including but not limited to the formatting, the choice verbalizers, and the ICL examples. To address this problem that results in unexpected performance degradation, calibration methods have been developed to mitigate the effects of these biases while recovering LLM performance. In this work, we first conduct a systematic analysis of the existing calibration methods, where we both provide a unified view and reveal the failure cases. Inspired by these analyses, we propose Batch Calibration (BC), a simple yet intuitive method that controls the contextual bias from the batched input, unifies various prior approaches, and effectively addresses the aforementioned issues. BC is zero-shot, inference-only, and incurs negligible additional costs. In the few-shot setup, we further extend BC to allow it to learn the contextual bias from labeled data. We validate the effectiveness of BC with PaLM 2-(S, M, L) and CLIP models and demonstrate state-of-the-art performance over previous calibration baselines across more than 10 natural language understanding and image classification tasks. View details
    On the Benefits of Traffic “Reprofiling” The Multiple Hops Case – Part I
    Henry Sariowan
    Jiaming Qiu
    Jiayi Song
    Roch Guerin
    IEEE/ACM Transactions on Networking (2024)
    Preview abstract Abstract—This paper considers networks where user traffic is regulated through deterministic traffic profiles, e.g. token buckets, and requirescleanguaranteed hard delay bounds. The network’s goal is to minimize the resources it needs to meet those cleanrequirementsbounds. The paper explores how reprofiling, i.e. proactively modifying how user traffic enters the network, can be of benefit. Reprofiling produces “smoother” flows but introduces an up-front access delay that forces tighter network delays. The paper explores this trade-off and demonstrates that, unlike what holds in the single-hop case, reprofiling can be of benefit even when “optimal”cleansophisticated schedulers are available at each hop. View details
    Factual and Personalized Recommendation Language Modeling with Reinforcement Learning
    Jihwan Jeong
    Mohammad Ghavamzadeh
    Proceedings of the First Conference on Language Modeling (COLM-24), Philadelphia (2024)
    Preview abstract Recommender systems (RSs) play a central role in connecting users to products, content and services by matching candidate items to users based on their preferences. While existing RSs often rely on implicit user feedback on recommended items (e.g., clicks, watches, ratings), conversational recommender systems are interacting with users to provide tailored recommendations in natural language. In this work, we aim to develop a recommender language model (LM) that is capable of generating compelling endorsement presentations of relevant items to users, to better explain the details of the items, to connect the items with users’ preferences, and to enhance the likelihood of users accepting recommendations. Specifically, such an LLM-based recommender can understand users’ preferences from users’ RS embeddings summarizing feedback history, output corresponding responses that not only are factually-grounded, but also explain whether these items satisfy users’ preferences in a convincing manner. The pivotal question is how one can gauge the performance of such a LLM recommender. Equipped with a joint reward function that measures factual consistency, convincingness, and personalization, not only can we evaluate the efficacies of different recommender LMs, but we can also utilize this metric as a form of AI feedback to fine-tune our LLM agent via reinforcement learning (RL). Building upon the MovieLens movie recommendation benchmark, we developed a novel conversational recommender delivering personalized movie narratives to users. This work lays the groundwork for recommendation systems that prioritize individualized user experiences without compromising on transparency and integrity. View details
    HESS Opinions: Never train a Long Short-Term Memory (LSTM) network on a single basin
    Frederik Kratzert
    Martin Gauch
    Daniel Klotz
    Hydrology and Earth System Sciences (2024)
    Preview abstract Machine learning (ML) has played an increasing role in the hydrological sciences. In particular, Long Short-Term Memory (LSTM) networks are popular for rainfall–runoff modeling. A large majority of studies that use this type of model do not follow best practices, and there is one mistake in particular that is common: training deep learning models on small, homogeneous data sets, typically data from only a single hydrological basin. In this position paper, we show that LSTM rainfall–runoff models are best when trained with data from a large number of basins. View details
    Bridging the Preference Gap between Retrievers and LLMs
    Zixuan Ke
    Qiaozhu Mei
    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (2024) (to appear)
    Preview abstract Large Language Models (LLMs) have demonstrated superior results across a wide range of tasks, and Retrieval-augmented Generation (RAG) is an effective way to enhance the performance by locating relevant information and placing it into the context window of the LLM. However, the relationship between retrievers and LLM in a RAG is still under-investigated. Most existing work treats the retriever and the LLM as independent components and leaves a gap between retrieving human-"friendly" information and assembling a LLM-"friendly" context. In this work, we examine a novel bridge mechanism. We validate the ranking and selection assumptions of retrievers in the context of RAG and propose a framework that chains together supervised and reinforcement learning to train a bridge model that optimizes the connection between the retriever and the LLM. Empirical results demonstrate the effectiveness of our method in both question-answering and personalized generation tasks. View details
    Understanding metric-related pitfalls in image analysis validation
    Annika Reinke
    Lena Maier-Hein
    Paul Jager
    Shravya Shetty
    Understanding Metrics Workgroup
    Nature Methods (2024)
    Preview abstract Validation metrics are key for the reliable tracking of scientific progress and for bridging the current chasm between artificial intelligence (AI) research and its translation into practice. However, increasing evidence shows that particularly in image analysis, metrics are often chosen inadequately in relation to the underlying research problem. This could be attributed to a lack of accessibility of metric-related knowledge: While taking into account the individual strengths, weaknesses, and limitations of validation metrics is a critical prerequisite to making educated choices, the relevant knowledge is currently scattered and poorly accessible to individual researchers. Based on a multi-stage Delphi process conducted by a multidisciplinary expert consortium as well as extensive community feedback, the present work provides the first reliable and comprehensive common point of access to information on pitfalls related to validation metrics in image analysis. Focusing on biomedical image analysis but with the potential of transfer to other fields, the addressed pitfalls generalize across application domains and are categorized according to a newly created, domain-agnostic taxonomy. To facilitate comprehension, illustrations and specific examples accompany each pitfall. As a structured body of information accessible to researchers of all levels of expertise, this work enhances global comprehension of a key topic in image analysis validation. View details
    Preview abstract In this paper we study users' opinions about the privacy of their mobile health apps. We look at what they write in app reviews in the 'Health & Fitness' category on the Google Play store. We identified 2832 apps in this category (based on 1K minimum installs). Using NLP/LLM analyses, we find that 76% of these apps have at least some privacy reviews. In total this yields over 164,000 reviews about privacy, from over 150 countries and in 25 languages. Our analyses identifies top themes and offers an approximation of how widespread these issues are around the world. We show that the top 4 themes - Data Sharing and Exposure, Permission Requests, Location Tracking and Data Collection - are issues of concern in over 70 countries. Our automatically generated thematic summaries reveal interesting aspects that deserve further research around user suspicions (unneeded data collection), user requests (more fine-grained control over data collection and data access), as well as user behavior (uninstalling apps). View details
    50 Shades of Support: A Device-Centric Analysis of Android Security Updates
    Abbas Acar
    Esteban Luques
    Harun Oz
    Ahmet Aris
    Selcuk Uluagac
    Network and Distributed System Security (NDSS) Symposium (2024)
    Preview abstract Android is by far the most popular OS with over three billion active mobile devices. As in any software, uncovering vulnerabilities on Android devices and applying timely patches are both critical. Android Open Source Project (AOSP) has initiated efforts to improve the traceability of security updates through Security Patch Levels (SPLs) assigned to devices. While this initiative provided better traceability for the vulnerabilities, it has not entirely resolved the issues related to the timeliness and availability of security updates for end users. Recent studies on Android security updates have focused on the issue of delay during the security update roll-out, largely attributing this to factors related to fragmentation. However, these studies fail to capture the entire Android ecosystem as they primarily examine flagship devices or do not paint a comprehensive picture of the Android devices’ lifecycle due to the datasets spanning over a short timeframe. To address this gap in the literature, we utilize a device-centric approach to analyze the security update behavior of Android devices. Our approach aims to understand the security update distribution behavior of OEMs (e.g., Samsung) by using a representative set of devices from each OEM and characterize the complete lifecycle of an average Android device. We obtained 367K official security update records from public sources, span- ning from 2014 to 2023. Our dataset contains 599 unique devices from four major OEMs that are used in 97 countries and are associated with 109 carriers. We identify significant differences in the roll-out of security updates across different OEMs, device models/types, and geographical regions across the world. Our findings show that the reasons for the delay in the roll-out of security updates are not limited to fragmentation but also involve OEM-specific factors. Our analysis also uncovers certain key issues that can be readily addressed as well as exemplary practices that can be immediately adopted by OEMs in practice. View details
    On the Benefits of Traffic “Reprofiling” The Single Hop Case
    Henry Sariowan
    Jiaming Qiu
    Jiayi Song
    Roch Guerin
    IEEE/ACM Transactions on Networking (2024)
    Preview abstract Datacenters have become a significant source of traffic, much of which is carried over private networks. The operators of those networks commonly have access to detailed traffic profiles and performance goals, which they seek to meet as efficiently as possible. Of interest are solutions that guarantee latency while minimizing network bandwidth. The paper explores a basic building block towards realizing such solutions, namely, a single hop configuration. The main results are in the form of optimal solutions for meeting local deadlines under schedulers of varying complexity and therefore cost. The results demonstrate how judiciously modifying flows’ traffic profiles, i.e., reprofiling them, can help simple schedulers reduce the bandwidth they require, often performing nearly as well as more complex ones. View details
    Using an LLM to Help With Code Understanding
    Daye Nam
    Vincent Hellendoorn
    Bogdan Vasilescu
    Brad A. Myers
    ICSE '24: Proceedings of the IEEE/ACM 46th International Conference on Software Engineering (2024)
    Preview abstract Understanding code is challenging, especially when working in new and complex development environments. Code comments and documentation can help, but are typically scarce or hard to navigate. Large language models (LLMs) are revolutionizing the process of writing code. Can they do the same for helping understand it? In this study, we provide a first investigation of an LLM-based conversational UI built directly in the IDE that is geared towards code understanding. Our IDE plugin queries OpenAI's GPT-3.5-turbo model with four high-level requests without the user having to write explicit prompts: to explain a highlighted section of code, provide details of API calls used in the code, explain key domain-specific terms, and provide usage examples for an API. The plugin also allows for open-ended prompts, which are automatically contextualized to the LLM with the program being edited. We evaluate this system in a user study with 32 participants, which confirms that using our plugin can aid task completion more than web search. We additionally provide a thorough analysis of the ways developers use, and perceive the usefulness of, our system, among others finding that the usage and benefits differ between students and professionals. We conclude that in-IDE prompt-less interaction with LLMs is a promising future direction for tool builders. View details
    Geographical accessibility to emergency obstetric care in urban Nigeria using closer-to-reality travel time estimates
    Aduragbemi Banke-Thomas
    Kerry L. M. Wong
    Tope Olubodun
    Peter M. Macharia
    Narayanan Sundararajan
    Yash Shah
    Mansi Kansal
    Swapnil Vispute
    Olakunmi Ogunyemi
    Uchenna Gwacham-Anisiobi
    Jia Wang
    Ibukun-Oluwa Omolade Abejirinde
    Prestige Tatenda Makanga
    Ngozi Azodoh
    Charles Nzelu, PhD
    Charlotte Stanton
    Bosede B. Afolabi
    Lenka Beňová
    Lancet Global Health (2024)
    Preview abstract Background Better accessibility of emergency obstetric care (CEmOC) facilities can significantly reduce maternal and perinatal deaths. However, pregnant women living in urban settings face additional complex challenges travelling to facilities. We estimated geographical accessibility and coverage to the nearest, second nearest, and third nearest public and private CEmOC facilities in the 15 largest Nigerian cities. Methods We mapped city boundaries, verified and geocoded functional CEmOC facilities, and assembled population distribution for women of childbearing age (WoCBA). We used Google Maps Platform’s internal Directions Application Programming Interface (API) to derive driving times to public, private, or either facility-type. Median travel time (MTT) and percentage of WoCBA able to reach care were summarised for eight traffic scenarios (peak and non-peak hours on weekdays and weekends) by city and within-city (wards) under different travel time thresholds (<15, <30, <60 min). Findings City-level MTT to the nearest CEmOC facility ranged from 18min (Maiduguri) to 46min (Kaduna). Within cities, MTT varied by location, with informal settlements and peripheral areas being the worst off. The percentages of WoCBA within 60min to their nearest public CEmOC were nearly universal; whilst the percentages of WoCBA within 30min reach to their nearest public CEmOC were between 33% in Aba to over 95% in Ilorin and Maiduguri. During peak traffic times, the median number of public CEmOC facilities reachable by WoCBA under 30min was zero in eight of 15 cities. Interpretation This approach provides more context-specific, finer, and policy-relevant evidence to support improving CEmOC service accessibility in urban Africa. View details