Jump to Content

Chapter 1B "Data Management Principles" _Reliable Machine Learning: Applying SRE Principles to ML in Production_

Cathy Chen
Kranti Parisa
Niall Richard Murphy
Todd Underwood
Reliable Machine Learning: Applying SRE Principles to ML in Production, O'Reilly (2022)
Google Scholar

Abstract

Machine learning is rapidly becoming a vital tool for many organizations today. It’s used to increase revenue, optimise decision making, understand customer behaviour (and influence it), and solve problems across a very wide set of domains, in some cases at performance levels significantly superior to human ones. Machine learning touches billions of people multiple times a day. Yet, industry-wide, the state of how organizations implement ML is, frankly, very poor. There isn’t even a framework describing how best to do it - most people are just making it up as they go along. There are many consequences to this, including poorer quality outcomes for both user and organization, lost revenue opportunities, legal exposures, and so on. Even worse is the fact that data, key to the success of ML, has become both a vitally important asset and a critical liability: organizations have not internalized how to manage it. For all these reasons and more, the industry needs a framework -- a way of understanding the issues around running actual, reliable, production-quality ML systems, and a collection of the actual practical and conceptual approaches to “reliable ML for everyone”. That makes it natural to reach for the conceptual framework provided by the Site Reliability Engineering discipline to provide that understanding. Bringing SRE approaches to running production systems helps them to be reliable, to scale well, to be well monitored, managed, and useful for customers; analogously, SRE approaches (including the Dickerson hierarchy, SLO & SLIs, effective data handling, and so on) for machine learning help to accomplish the same ends. Yet SRE approaches are not the totality of the story. We provide guidance for model developers, data scientists, and business owners to perform the nuts and bolts of their day to day jobs, while also keeping the bigger picture in mind. In other words, this book applies an SRE mindset to machine learning, and shows how to run an effective, efficient, and reliable ML system, whether you are a small startup or a planet-spanning megacorp. It will describe what to do whether you are starting from a completely blank slate, or have significant scale already. It will describe operational approaches, data-centric ways of thinking about production systems, and ethical guidelines - increasing important in today’s world.