Bayesian text segmentation for index term identification and keyphrase extraction

David Newman
Jey Han Lau
Timothy Baldwin
Proceedings of International Conference on Computational Linguistics (COLING) (2012)

Abstract

Automatically extracting terminology and index terms from scientific literature is useful for a variety of digital library, indexing and search applications. This task is non-trivial, complicated by domain-specific terminology and a steady introduction of new terminology. Correctly identifying nested terminology further adds to the challenge. We present a Dirichlet Process (DP) model of word segmentation where multiword segments are either retrieved from a cache or newly generated. We show how this DP-Segmentation model can be used to successfully extract nested terminology, outperforming previous methods for solving this problem.