Google Research

Sequence-to-sequence Neural Network model with 2D attention for learning Japanese pitch accents

  • Antoine Bruguier
  • Heiga Zen
  • Arkady Arkhangorodsky
Interspeech, vol. 2018 (2018)


Many Japanese text-to-speech (TTS) systems use word-level pitch accents as one of their prosodic features. Combination of a pronunciation dictionary including lexical pitch accents and a statistical model representing the word accent sandhi is often used to predict pitch accents from a text. However, using human transcribers to build the dictionary and training data for the model is tedious and expensive. This paper proposes a neural pitch accent recognition model. This model combines the information from audio, and its transcription (word sequence in hiragana characters) via two-dimensional attention and outputs word-level pitch accents. Experimental results show a reduction in the word pitch accent prediction error rate over that with text only. It lowers the load of human annotators when building a pronunciation dictionary. As the approach is general, it can be used to do pronunciation learning in other languages as well.

Research Areas

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work