WEST: Word Encoded Sequence Transducers

Ananda Theertha Suresh
Mitchel Weintraub
ICASSP 2019, IEEE(2019)
Google Scholar


Modeling tasks that use a large vocabulary require two words-to-vector maps, one for the embedding layer and one for the softmax layer. A majority of model parameters for such modeling tasks are in the embedding and the softmax layers, while only a small fraction of the parameters are used to the core of the model e.g., recurrent structures such as LSTM. When training models on small to medium corpus size, these models are subject to over-tting as well as large storage and memory footprint requirements. We propose to compress the embedding and softmax matrices by imposing structure into the parameter space. The embedding and softmax matrices are factored as the product of a sparse matrix and a structured dense matrix. Without compromizing performance, we achieve a significant compression rate for the embedding layer and a moderate compression rate for the softmax layer. The factoring of the embedding and softmax matrix before training allows us to jointly train these matrix values to optimize the training objective. Being able to compress the embedding and softmax layers allows us to uses this saved memory for increased recurrent unit size, which results in improved performance at an uncompressed memory footprint. We report results of this compression technique on standard datasets and a state of the art on-device automatic speech recognition system.