Text Normalization Infrastructure that Scales to Hundreds of Language Varieties

Mason Chua; Daan van Esch; Noah Coccaro; Eunjoon Cho; Sujeet Bhandari; Libin Jia

Text Normalization Infrastructure that Scales to Hundreds of Language Varieties

Mason Chua

Daan van Esch

Noah Coccaro

Eunjoon Cho

Sujeet Bhandari

Libin Jia

Proceedings of the 11th edition of the Language Resources and Evaluation Conference (2018)

Download Google Scholar

Abstract

We describe the automated multi-language text normalization infrastructure that prepares textual data to train language models used in Google's keyboards and speech recognition systems, across hundreds of language varieties. Training corpora are sourced from various types of data sets, and the text is then normalized using a sequence of hand-written grammars and learned models. These systems need to scale to hundreds or thousands of language varieties in order to meet product needs. Frequent data refreshes, privacy considerations and simultaneous updates across such a high number of languages make manual inspection of the normalized training data infeasible, while there is ample opportunity for data normalization issues. By tracking metrics about the data and how it was processed, we are able to catch internal data processing issues and external data corruption issues that can be hard to notice using standard extrinsic evaluation methods. Showing the importance of paying attention to data normalization behavior in large-scale pipelines, these metrics have highlighted issues in Google's real-world speech recognition system that have caused significant, but latent, quality degradation.

Research Areas

Natural language processing

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Text Normalization Infrastructure that Scales to Hundreds of Language Varieties

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs