Cross-lingual text clustering in a large system

Nicole R. Schneider; Jagan Sankaranarayanan; Hanan Samet

Cross-lingual text clustering in a large system

Nicole R. Schneider

Jagan Sankaranarayanan

Hanan Samet

2023 7th International Conference on Natural Language Processing and Information Retrieval (NLPIR 2023) (2023) (to appear)

Download Google Scholar

Abstract

The multilingual world needs systems that can cluster text written
in multiple languages into the same thread or topic. A practical
approach for clustering text in different languages is to first translate

into a common language, such as English, and then cluster it post-
translation. While this approach seems sensible, the performance

and pitfalls of this approach have not been well studied. The
reference architecture used for the study is a news system that
has continuously indexed news articles over many years in over
19 languages. Through the analysis of these documents and their
clusters, the clustering quality is shown to be dependent on the
translator’s ability to normalize proper noun usage, the geographic
focus of the text, and the document topic.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Cross-lingual text clustering in a large system

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs