Google Research

Migrating a Privacy-Safe Information Extraction System to a Software 2.0 Design

Proceedings of the 10th Annual Conference on Innovative Data Systems Research (2020)

Abstract

This paper presents a case study of migrating a privacy-safe information extraction system for Gmail from a traditional rule-based architecture to a machine-learned Software 2.0 architecture. The key idea is to use the extractions from the existing rule-based system as training data to learn ML models that in turn replace all the machinery for the rule-based system. The resulting system a) delivers better precision and recall, b) is significantly smaller in terms of lines of code, c) has been easier to maintain and improve, and d) has opened up the possibility of leveraging ML advances to build a cross-language extraction system even though our original training data was only in English. We describe challenges encountered during this migration around generation and management of training data, evaluation of models, and report on many traditional ``Software 1.0'' components we built to address them.

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work