SMALL FOOTPRINT MULTI-CHANNEL KEYWORD SPOTTING
Abstract
Noise robustness remains a challenging problem in on-device keyword spotting,
which can be improved by using multiple microphones. While this increases
accuracy, it inevitably pushes up computational complexity and tends to
require for more memory space. In this
paper, we propose a new neural-network based architecture which takes multiple
microphone signals as inputs. It can achieve better accuracy and incurs
just a minimum increase in model size. Compared with
a single-channel baseline which runs in parallel on each channel, the
proposed architecture reduces the false reject (FR) rate relatively by 27.2%
and 31.8% on dual-microphone clean and noisy test sets, respectively,
at a rate of 0.1 false accepts (FA) per hour.
which can be improved by using multiple microphones. While this increases
accuracy, it inevitably pushes up computational complexity and tends to
require for more memory space. In this
paper, we propose a new neural-network based architecture which takes multiple
microphone signals as inputs. It can achieve better accuracy and incurs
just a minimum increase in model size. Compared with
a single-channel baseline which runs in parallel on each channel, the
proposed architecture reduces the false reject (FR) rate relatively by 27.2%
and 31.8% on dual-microphone clean and noisy test sets, respectively,
at a rate of 0.1 false accepts (FA) per hour.