Distributed Acoustic Modeling with Back-off N-grams
Abstract
The paper proposes an approach to acoustic modeling that
borrows from n-gram language modeling in an attempt to
scale up both the amount of training data and model size (as
measured by the number of parameters in the model) to approximately 100 times larger than current sizes used in ASR. Dealing with unseen phonetic contexts is accomplished
using the familiar back-off technique used in language modeling due to implementation simplicity. The new acoustic
model is estimated and stored using the MapReduce distributed computing infrastructure. Speech recognition experiments are carried out in an Nbest rescoring framework for Google Voice Search. 87,000 hours of training data is obtained in an unsupervised fashion by filtering utterances in Voice Search logs on ASR confidence. The resulting models are trained using maximum likelihood and contain 20-40 million Gaussians. They achieve
relative reductions in WER of 11% and 6% over first-pass
models trained using maximum likelihood, and boosted MMI,
respectively.
borrows from n-gram language modeling in an attempt to
scale up both the amount of training data and model size (as
measured by the number of parameters in the model) to approximately 100 times larger than current sizes used in ASR. Dealing with unseen phonetic contexts is accomplished
using the familiar back-off technique used in language modeling due to implementation simplicity. The new acoustic
model is estimated and stored using the MapReduce distributed computing infrastructure. Speech recognition experiments are carried out in an Nbest rescoring framework for Google Voice Search. 87,000 hours of training data is obtained in an unsupervised fashion by filtering utterances in Voice Search logs on ASR confidence. The resulting models are trained using maximum likelihood and contain 20-40 million Gaussians. They achieve
relative reductions in WER of 11% and 6% over first-pass
models trained using maximum likelihood, and boosted MMI,
respectively.