A Study of Raters' Sensitivity to Inter-sentence Pause Durations in American English Speech

Paul Owoicho
Speech Prosody 2024 (SP2024) (2024) (to appear)
Google Scholar

Abstract

Inter-sentence pauses are the silences that occur between sentences in a paragraph or a dialogue.
They are an important aspect of long-form speech prosody, as they can affect the naturalness, intelligibility, and effectiveness of communication.
However, the user perception of inter-sentence pauses in long-form speech synthesis is not well understood. Previous work often evaluates pause modelling in conjunction with other prosodic features making it hard to explicitly study how raters perceive differences in inter-sentence pause lengths.
In this paper, using multiple text-to-speech (TTS) datasets that cover different content types, domains, and settings, we investigate how sensitive raters are to changes to the durations of inter-sentence pauses in long-form speech by comparing ground truth audio samples with renditions that have manipulated pause durations.
This experimental design is meant to allow us to draw conclusions regarding the utility that can be expected from similar evaluations when applied to synthesized long-form speech.
We find that, using standard evaluation methodologies, raters are not sensitive to variations in pause lengths unless these deviate exceedingly from the norms or expectations of the speech context.