Migrating a Privacy-Safe Information Extraction System to a Software 2.0 Design
Abstract
This paper presents a case study of migrating a privacy-safe information extraction system for Gmail from a traditional rule-based architecture to a machine-learned Software 2.0 architecture. The key idea is to use the extractions from the existing rule-based system as training data to learn ML models
that in turn replace all the machinery for the rule-based system. The resulting system a) delivers better precision and recall, b) is significantly smaller in terms of lines of code, c) has been easier to maintain and improve, and d) has opened up the possibility of leveraging ML advances to build a cross-language extraction system even though our original training data was only in English. We describe challenges encountered during this migration around generation and management of training data, evaluation of models, and report on many traditional ``Software 1.0'' components we built to address them.
that in turn replace all the machinery for the rule-based system. The resulting system a) delivers better precision and recall, b) is significantly smaller in terms of lines of code, c) has been easier to maintain and improve, and d) has opened up the possibility of leveraging ML advances to build a cross-language extraction system even though our original training data was only in English. We describe challenges encountered during this migration around generation and management of training data, evaluation of models, and report on many traditional ``Software 1.0'' components we built to address them.