Regularizing Word Segmentation by Creating Misspellings

Bhuvana Ramabhadran; Hainan Xu; Jesse Emond; Kartik Audhkhasi; Yinghui Huang

Regularizing Word Segmentation by Creating Misspellings

Bhuvana Ramabhadran

Hainan Xu

Jesse Emond

Kartik Audhkhasi

Yinghui Huang

Interspeech 2021 (2021) (to appear)

Google Scholar

Abstract

This work focuses on improving subword segmentation algorithms for end-to-end speech recognition models, and makes two major contributions. Firstly, we propose a novel word segmentation algorithm. The algorithm uses the same vocabulary file generated by a regular wordpiece model, is easily extensible and supports a variety of regularization techniques in the segmentation space, and outperforms the regular wordpiece model. Secondly, we propose a number of novel regularization methods that introduces randomness into the tokenization algorithm, which bring further gains in speech recognition performance. A noteworthy discovery from this work is that creating artificial misspelling in words results in the best performance among all the methods, which could inspire future research for strategies in this area.

Research Areas

Machine intelligence

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Regularizing Word Segmentation by Creating Misspellings

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs