Jump to Content

ML for Operations

Steven Ross
Todd Underwood
;login:, vol. 44 (2020), unk-unk

Abstract

Machine Learning (ML) is often proposed as the solution to automate this unpleasant work. Many believe that ML will provide near-magical solutions to these problems. This article is for developers and systems engineers with production responsibilities who are lured by the siren song of magical operations that ML seems to sing. Assuming no prior detailed expertise in ML, we provide an overview of how ML works and doesn’t, production considerations with using it, and an assessment of considerations for using ML to solve various operations problems. Even in an age of cloud services, maintaining applications in production is full of hard and tedious work. This is unrewarding labor, or toil, that we collectively would like to automate. The worst of this toil is manual, repetitive, tactical, devoid of enduring value, and scales linearly as a service grows. Think of work such as manually building/testing/deploying binaries, configuring memory limits, and responding to false-positive pages. This toil takes time from activities that are more interesting and produce more enduring value, but it exists because it takes just enough human judgement that it is difficult to find simple, workable heuristics to replace those humans. We will list a number of ideas that appear plausible but, in fact, are not workable.

Research Areas