Distributed representation and estimation of WFST-based n-gram models

Cyril Allauzen; Michael Riley; Brian Roark

Distributed representation and estimation of WFST-based n-gram models

Cyril Allauzen

Michael Riley

Brian Roark

Proceedings of the ACL Workshop on Statistical NLP and Weighted Automata (StatFSM) (2016), pp. 32-41

Download Google Scholar

Abstract

We present methods for partitioning a weighted finite-state transducer (WFST) representation of an n-gram language model into multiple shards, each of which is a stand-alone WFST n-gram model in its own right, allowing processing with existing algorithms. After independent estimation, including normalization, smoothing and pruning on each shard, the shards can be merged into a single WFST that is identical to the model that would have resulted from estimation without sharding. We then present an approach that uses data partitions in conjunction with WFST sharding to estimate models on orders-of-magnitude more data than would have otherwise been feasible with a single process. We present some numbers on shard characteristics when large models are trained from a very large data set. Functionality to support distributed n-gram modeling has been added to the OpenGrm library.

Research Areas

Natural language processing

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Distributed representation and estimation of WFST-based n-gram models

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs