Stanford University’s Chinese-to-English Statistical Machine Translation System for the 2008 NIST Evaluation

Michel Galley
Jenny R. Finkel
Christopher D. Manning
The 2008 NIST Open Machine Translation Evaluation Meeting(2008)

Abstract

This document describes Stanford University’s first entry into a NIST MT evaluation. Our entry to the 2008 evaluation mainly focused on establishing a competent baseline with a phrase-based system similar to (Och and Ney, 2004; Koehn et al., 2007). In a three-week effort prior to the evaluation, our attention focused on scaling up our system to exploit nearly all Chinese-English parallel data permissible under the constrained track, incorporating competitive language models into the decoder using Gigaword and Google n-grams, evaluating Chinese word segmentation models, and incorporating a document classifier as a pre-processing stage to the decoder. This document is organized as follows: in Section 2, we describe linguistic resources used for our submission. In Section 3, we present the four main components of our translation system, i.e., a phrase-based translation system, a Chinese word segmenter, a text categorizer, and a truecaser. Finally, we discuss our results in Section 4.

Research Areas