Online Template Induction for Machine-Generated Emails
Abstract
Most consumer email in the world is machine-generated communication from a businesses to a human. Understanding the underlying templates that are used to instantiate these templates is a key step to enabling a variety of intelligent experiences. In this paper, we present the first description of the template-induction problem in an online setting for a planet-scale email system. While previous work has addressed the problem of discovering these templates using an offline batch job (perhaps architected as a MapReduce), discovering these templates online has several advantages. In this paper, we present the design of an online template induction system and describe the design choices we had to make. The resulting system handles online template induction over a stream of several billion emails a day. With the new system, new incoming email can be identified as belonging to a known template within minutes of discovering a template compared to several days worth of delay with the previous batch approach. Further, the online system has a resource consumption footprint that is 10x smaller than the batch approach. We also report on the surprising lesson we learned that conventional stream processing systems did not present a good framework on which to build this system. We hope that the lessons from this system help designers of future stream processing systems accommodate a broader range of applications like online template induction.