Multiframe Deep Neural Networks for Acoustic Modeling
Abstract
Deep neural networks have been shown to perform very well
as acoustic models for automatic speech recognition. Compared
to Gaussian mixtures however, they tend to be very
expensive computationally, making them challenging to use
in real-time applications. One key advantage of such neural
networks is their ability to learn from very long observation
windows going up to 400 ms. Given this very long temporal
context, it is tempting to wonder whether one can run neural
networks at a lower frame rate than the typical 10 ms, and
whether there might be computational benefits to doing so.
This paper describes a method of tying the neural network parameters
over time which achieves comparable performance
to the typical frame-synchronous model, while achieving up
to a 4X reduction in the computational cost of the neural network
activations.