Google Research

A Brief Guide to Running ML Systems in Production



This article discusses principles and best practices for DevOps and SRE practitioners who are deploying and operating ML systems. This article draws on our experiences running production services for the past 15 years as well as from discussions with Google engineers working on diverse ML systems. We will use specific incidents to illustrate where ML-based systems did not behave as expected for developers of traditional systems, and examine the outcomes in light of the recommended practices.

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work