Occam's Adaptation: A Comparison of Interpolation of Bases Adaptation Methods for Multi-Dialect Acoustic Modeling with LSTMs
Abstract
Multidialectal languages can pose challenges for acoustic modeling. Past research has shown that with a large training corpus but without explicit modeling of inter-dialect variability, training individual per-dialect models yields superior performance to that of a single model trained on the combined data [1, 2]. In this work, we were motivated by the idea that adaptation techniques can allow the models to learn dialect-independent features and in turn leverage the power of the larger training corpus sizes afforded when pooling data across dialects. Our goal was thus to create a single multidialect acoustic model that would rival the performance of the dialect-specific models.Working in the context of deep Long-Short Term Memory (LSTM) acoustic models trained on up to 40K hours of speech, we explored several methods for training and incorporating dialect-specific information into the model, including 12 variants of interpolation-of-bases techniques related to Cluster Adaptive Training (CAT) [3] and Factorized Hidden Layer (FHL) [4] techniques. We found that with our model topology and large training corpus, simply appending the dialect-specific information to the feature vector resulted in a more accurate model than any of the more complex interpolation-of-bases techniques, while requiring less model complexity and fewer parameters. This simple adaptation yielded a single unified model for all dialects that, in most cases, outperformed individual models which had been trained per-dialect.