WEST: Word Encoded Sequence Transducers
Abstract
Modeling tasks that use a large vocabulary require two words-to-vector maps, one for
the embedding layer and one for the softmax layer. A majority of model parameters for
such modeling tasks are in the embedding and the softmax layers, while only a small fraction
of the parameters are used to the core of the model e.g., recurrent structures such as
LSTM. When training models on small to medium corpus size, these models are subject
to over-tting as well as large storage and memory footprint requirements. We propose to
compress the embedding and softmax matrices by imposing structure into the parameter
space. The embedding and softmax matrices are factored as the product of a sparse matrix
and a structured dense matrix. Without compromizing performance, we achieve a significant
compression rate for the embedding layer and a moderate compression rate for the
softmax layer. The factoring of the embedding and softmax matrix before training allows us
to jointly train these matrix values to optimize the training objective. Being able to compress
the embedding and softmax layers allows us to uses this saved memory for increased
recurrent unit size, which results in improved performance at an uncompressed memory
footprint. We report results of this compression technique on standard datasets and a state
of the art on-device automatic speech recognition system.
the embedding layer and one for the softmax layer. A majority of model parameters for
such modeling tasks are in the embedding and the softmax layers, while only a small fraction
of the parameters are used to the core of the model e.g., recurrent structures such as
LSTM. When training models on small to medium corpus size, these models are subject
to over-tting as well as large storage and memory footprint requirements. We propose to
compress the embedding and softmax matrices by imposing structure into the parameter
space. The embedding and softmax matrices are factored as the product of a sparse matrix
and a structured dense matrix. Without compromizing performance, we achieve a significant
compression rate for the embedding layer and a moderate compression rate for the
softmax layer. The factoring of the embedding and softmax matrix before training allows us
to jointly train these matrix values to optimize the training objective. Being able to compress
the embedding and softmax layers allows us to uses this saved memory for increased
recurrent unit size, which results in improved performance at an uncompressed memory
footprint. We report results of this compression technique on standard datasets and a state
of the art on-device automatic speech recognition system.