Bone Conducted Signal Guided Speech Enhancement For Voice Assistant on Earbuds
Abstract
In this work we present a multi-modal, streaming enhancement net-work to improve speech recognition for voice assistants. The pro-posed model is guided by the bone conducted signal (BCS) to sep-arate the interfering sources from the target speaker signal. Wet trained the model on a simulated speech enhancement training set with a simulated BCS and finetune it on a small earbuds specific training set, consisting of less than 7 hours of speech. To account for distorted BCS the enhancement module is complemented by a voice activity-based decision to discard the enhanced output for BCS without speech information. A possibility to preprocess the BCS to account for the low-pass characteristic of the bone conduction is evaluated to lower the required transmission bandwidth from the ear-buds to the recognition device. The results show that a reduction of the BCS bandwidth can be reduced to 500 Hz with only a small losses in word error rate (WER). The system with and without bandwidth reduction are compared to a state-of-the-art multi-channel enhancement method on a realistic test set and outperforms the multi-channel model for most of the considered sets