Cross-lingual text clustering in a large system
Abstract
The multilingual world needs systems that can cluster text written
in multiple languages into the same thread or topic. A practical
approach for clustering text in different languages is to first translate
into a common language, such as English, and then cluster it post-
translation. While this approach seems sensible, the performance
and pitfalls of this approach have not been well studied. The
reference architecture used for the study is a news system that
has continuously indexed news articles over many years in over
19 languages. Through the analysis of these documents and their
clusters, the clustering quality is shown to be dependent on the
translator’s ability to normalize proper noun usage, the geographic
focus of the text, and the document topic.
in multiple languages into the same thread or topic. A practical
approach for clustering text in different languages is to first translate
into a common language, such as English, and then cluster it post-
translation. While this approach seems sensible, the performance
and pitfalls of this approach have not been well studied. The
reference architecture used for the study is a news system that
has continuously indexed news articles over many years in over
19 languages. Through the analysis of these documents and their
clusters, the clustering quality is shown to be dependent on the
translator’s ability to normalize proper noun usage, the geographic
focus of the text, and the document topic.