Stanford University’s Chinese-to-English Statistical Machine Translation System for the 2008 NIST Evaluation
Abstract
This document describes Stanford University’s first entry into a NIST MT evaluation. Our entry to the
2008 evaluation mainly focused on establishing a competent baseline with a phrase-based system similar to (Och and Ney, 2004; Koehn et al., 2007). In a three-week effort prior to the evaluation, our attention focused on scaling up our system to exploit nearly all Chinese-English parallel data permissible under the constrained track, incorporating competitive language models into the decoder using Gigaword and Google n-grams, evaluating Chinese word segmentation models, and incorporating a document classifier as a pre-processing stage to the decoder.
This document is organized as follows: in Section 2, we describe linguistic resources used for our
submission. In Section 3, we present the four main components of our translation system, i.e., a phrase-based translation system, a Chinese word segmenter, a text categorizer, and a truecaser. Finally, we discuss our results in Section 4.