- Parisa Haghani
- Arun Narayanan
- Michiel Adriaan Unico Bacchiani
- Galen Chuang
- Neeraj Gaur
- Pedro Jose Moreno Mengibar
- Delia Qu
- Rohit Prabhavalkar
- Austin Waters
Abstract
Conventional spoken language understanding systems consist of two main components: an automatic speech recognition module that converts audio to text, and a natural language understanding module that transforms the resulting text (or top N hypotheses) into a set of intents and arguments. These modules are typically optimized independently. In this paper, we formulate audio to semantic understanding as a sequence-to-sequence problem. We propose and compare various encoder-decoder based approaches that optimizes both modules jointly, in an end-to-end manner. We evaluate these methods on a real-world task. Our results show that having an intermediate text representation while jointly optimizing the full system improves accuracy of prediction.
Research Areas
Learn more about how we do research
We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work