Towards Automatic Corpus Preparation for a German Broadcast News Transcription System

Wolfgang Macherey; Hermann Ney

Towards Automatic Corpus Preparation for a German Broadcast News Transcription System

Wolfgang Macherey

Hermann Ney

Int. Conf. on Spoken Language Processing (2002), pp. 733-736

Download Google Scholar

Abstract

When setting up a speech recognition system for a new domain, a lot of manual effort is spent on corpus preparation, i.e., data acquisition, cutting and segmentation of the audio material, generation of pronunciation lexica, as well as the definition of suitable training and test sets. In this paper we describe several methods that help to automate and thus to speed up this procedure. For this purpose, we assume that only a preliminary, partially incorrect textual transcription is available. The effectivity of the proposed methods is demonstrated with the development of a transcription system for the recognition of German broadcast news.

Research Areas

Natural language processing

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Towards Automatic Corpus Preparation for a German Broadcast News Transcription System

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs