Low Latency Speech Recognition using End-to-End Prefetching
Abstract
Latency is a crucial metric for streaming speech recognition systems. In this paper, we reduce latency by fetching responses early based on the partial recognition results and refer to it as prefetching. Specifically, prefetching works by submitting partial recognition results for subsequent processing such as obtaining assistant server responses or second-pass rescoring before the recognition result is finalized. If the partial result matches the final recognition result, the early fetched response can be delivered to the user instantly. This effectively speeds up the system by saving the execution latency that typically happens after recognition is completed.
Prefetching can be triggered multiple times for a single query, but this leads to multiple rounds of downstream processing and increases the computation costs. It is hence desirable to fetch the result sooner but meanwhile limiting the number of prefetches. To achieve the best trade-off between latency and computation cost, we investigated a series of prefetching decision models including decoder silence based prefetching, acoustic silence based prefetching and end-to-end prefetching.
In this paper, we demonstrate the proposed prefetching mechanism reduced 200 ms for a system that consists of a streaming first pass model using recurrent neural network transducer (RNN-T) and a non-streaming second pass rescoring model using Listen, Attend and Spell (LAS) [1]. We observe that the endto-end prefetching provides the best trade-off between cost and latency that is 100 ms faster compared to silence based prefetching at a fixed prefetch rate.
Prefetching can be triggered multiple times for a single query, but this leads to multiple rounds of downstream processing and increases the computation costs. It is hence desirable to fetch the result sooner but meanwhile limiting the number of prefetches. To achieve the best trade-off between latency and computation cost, we investigated a series of prefetching decision models including decoder silence based prefetching, acoustic silence based prefetching and end-to-end prefetching.
In this paper, we demonstrate the proposed prefetching mechanism reduced 200 ms for a system that consists of a streaming first pass model using recurrent neural network transducer (RNN-T) and a non-streaming second pass rescoring model using Listen, Attend and Spell (LAS) [1]. We observe that the endto-end prefetching provides the best trade-off between cost and latency that is 100 ms faster compared to silence based prefetching at a fixed prefetch rate.