Unary Data Structures for Language Models
Abstract
Language models are important components of speech recognition and machine translation systems.
Trained on billions of words, and consisting of billions of parameters, language models often are the
single largest components of these systems. There have been many proposed techniques to reduce the
storage requirements for language models. A technique based upon pointer-free compact storage of
ordinal trees shows compression competitive with the best proposed systems, while retaining the full
finite state structure, and without using computationally expensive block compression schemes or
lossy quantization techniques.
Trained on billions of words, and consisting of billions of parameters, language models often are the
single largest components of these systems. There have been many proposed techniques to reduce the
storage requirements for language models. A technique based upon pointer-free compact storage of
ordinal trees shows compression competitive with the best proposed systems, while retaining the full
finite state structure, and without using computationally expensive block compression schemes or
lossy quantization techniques.