Towards Automatic Corpus Preparation for a German Broadcast News Transcription System

Hermann Ney
Int. Conf. on Spoken Language Processing (2002), pp. 733-736

Abstract

When setting up a speech recognition system for a new domain, a lot of manual effort is spent on corpus preparation, i.e., data acquisition, cutting and segmentation of the audio material, generation of pronunciation lexica, as well as the definition of suitable training and test sets. In this paper we describe several methods that help to automate and thus to speed up this procedure. For this purpose, we assume that only a preliminary, partially incorrect textual transcription is available. The effectivity of the proposed methods is demonstrated with the development of a transcription system for the recognition of German broadcast news.
×