Jump to Content


Jilong Wu
Yiteng Huang
Hyun Jin Park
Niranjan Subrahmanya
Patrick Violette
Odyssey 2020 The Speaker and Language Recognition Workshop (2020)


Noise robustness remains a challenging problem in on-device keyword spotting, which can be improved by using multiple microphones. While this increases accuracy, it inevitably pushes up computational complexity and tends to require for more memory space. In this paper, we propose a new neural-network based architecture which takes multiple microphone signals as inputs. It can achieve better accuracy and incurs just a minimum increase in model size. Compared with a single-channel baseline which runs in parallel on each channel, the proposed architecture reduces the false reject (FR) rate relatively by 27.2% and 31.8% on dual-microphone clean and noisy test sets, respectively, at a rate of 0.1 false accepts (FA) per hour.

Research Areas