Google Research

Regularizing Word Segmentation by Creating Misspellings

Interspeech 2021 (2021) (to appear)

Abstract

This work focuses on improving subword segmentation algorithms for end-to-end speech recognition models, and makes two major contributions. Firstly, we propose a novel word segmentation algorithm. The algorithm uses the same vocabulary file generated by a regular wordpiece model, is easily extensible and supports a variety of regularization techniques in the segmentation space, and outperforms the regular wordpiece model. Secondly, we propose a number of novel regularization methods that introduces randomness into the tokenization algorithm, which bring further gains in speech recognition performance. A noteworthy discovery from this work is that creating artificial misspelling in words results in the best performance among all the methods, which could inspire future research for strategies in this area.

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work