Jump to Content

Cross-lingual text clustering in a large system

Nicole R. Schneider
Hanan Samet
2023 7th International Conference on Natural Language Processing and Information Retrieval (NLPIR 2023) (2023) (to appear)

Abstract

The multilingual world needs systems that can cluster text written in multiple languages into the same thread or topic. A practical approach for clustering text in different languages is to first translate into a common language, such as English, and then cluster it post- translation. While this approach seems sensible, the performance and pitfalls of this approach have not been well studied. The reference architecture used for the study is a news system that has continuously indexed news articles over many years in over 19 languages. Through the analysis of these documents and their clusters, the clustering quality is shown to be dependent on the translator’s ability to normalize proper noun usage, the geographic focus of the text, and the document topic.