From audio to semantics: Approaches to end-to-end spoken language understanding

Galen Chuang
Pedro Jose Moreno Mengibar
Delia Qu
Spoken Language Technology Workshop (SLT), 2018 IEEE
Google Scholar


Conventional spoken language understanding systems consist of two main components: an automatic speech recognition module that converts audio to text, and a natural language understanding module that transforms the resulting text (or top N hypotheses) into a set of intents and arguments. These modules are typically optimized independently. In this paper, we formulate audio to semantic understanding as a sequence-to-sequence problem. We propose and compare various encoder-decoder based approaches that optimizes both modules jointly, in an end-to-end manner. We evaluate these methods on a real-world task. Our results show that having an intermediate text representation while jointly optimizing the full system improves accuracy of prediction.

Research Areas