Sequence-to-sequence Neural Network model with 2D attention for learning Japanese pitch accents

Antoine Bruguier
Arkady Arkhangorodsky
Interspeech, 2018(2018)
Google Scholar


Many Japanese text-to-speech (TTS) systems use word-level pitch accents as one of their prosodic features. Combination of a pronunciation dictionary including lexical pitch accents and a statistical model representing the word accent sandhi is often used to predict pitch accents from a text. However, using human transcribers to build the dictionary and training data for the model is tedious and expensive. This paper proposes a neural pitch accent recognition model. This model combines the information from audio, and its transcription (word sequence in hiragana characters) via two-dimensional attention and outputs word-level pitch accents. Experimental results show a reduction in the word pitch accent prediction error rate over that with text only. It lowers the load of human annotators when building a pronunciation dictionary. As the approach is general, it can be used to do pronunciation learning in other languages as well.

Research Areas