A Brief Guide to Running ML Systems in Production

Carlos Villavieja
(2020)

Abstract

This article discusses principles and best practices for DevOps and SRE practitioners who are deploying and operating ML systems. This article draws on our experiences running production services for the past 15 years as well as from discussions with Google engineers working on diverse ML systems. We will use specific incidents to illustrate where ML-based systems did not behave as expected for developers of traditional systems, and examine the outcomes in light of the recommended practices.