Generation of large-scale simulated utterances in virtual rooms to train deep-neural networks for far-field speech recognition in Google Home
Abstract
We describe the structure and application of an acoustic room
simulator to generate large-scale simulated data for training
deep neural networks for far-field speech recognition. The system
simulates millions of different room dimensions, a wide
distribution of reverberation time and signal-to-noise ratios,
and a range of microphone and sound source locations. We
start with a relatively clean training set as the source and artificially
create simulated data by randomly sampling a noise
configuration for every new training example. As a result,
the acoustic model is trained using examples that are virtually
never repeated. We evaluate performance of this approach
based on room simulation using a factored complex Fast Fourier
Transform (CFFT) acoustic model introduced in our earlier
work, which uses CFFT layers and LSTM AMs for joint multichannel
processing and acoustic modeling. Results show that
the simulator-driven approach is quite effective in obtaining
large improvements not only in simulated test conditions, but
also in real / rerecorded conditions. This room simulation system
has been employed in training acoustic models including
the ones for the recently released Google Home.
simulator to generate large-scale simulated data for training
deep neural networks for far-field speech recognition. The system
simulates millions of different room dimensions, a wide
distribution of reverberation time and signal-to-noise ratios,
and a range of microphone and sound source locations. We
start with a relatively clean training set as the source and artificially
create simulated data by randomly sampling a noise
configuration for every new training example. As a result,
the acoustic model is trained using examples that are virtually
never repeated. We evaluate performance of this approach
based on room simulation using a factored complex Fast Fourier
Transform (CFFT) acoustic model introduced in our earlier
work, which uses CFFT layers and LSTM AMs for joint multichannel
processing and acoustic modeling. Results show that
the simulator-driven approach is quite effective in obtaining
large improvements not only in simulated test conditions, but
also in real / rerecorded conditions. This room simulation system
has been employed in training acoustic models including
the ones for the recently released Google Home.