Felicia S. C. Lim
Research Areas
Authored Publications
Sort By
Preview abstract
We present a multi-channel audio signal generation scheme based on machine-learning and probabilistic modeling. We start from modeling a multi-channel single-source signal. Such signals are naturally modeled as a single-channel reference signal and a spatial-arrangement (SA) model specified by an SA parameter sequence.We focus on the SA model and assume that the reference signal is described by some parameter sequence. The SA model parameters are described with a learned probability distribution that is conditioned by the reference-signal parameter sequence and, optionally, an SA conditioning sequence. If present, the SA conditioning sequence specifies a signal class or a specific signal. The single-source method can be used for multi-source signals by applying source separation or by using an SA model that operates on non-overlapping frequency bands. Our GAN-based stereo coding implementation of the latter approach shows that our paradigm facilitates plausible high-quality rendering at a low bit rate for the SA conditioning.
View details
Handling Background Noise in Neural Speech Generation
Tom Denton
Alejandro Luebs
Andrew Storus
Hengchin Ye
W. Bastiaan Kleijn
2020 Asilomar Conference on Signals, Systems, and Computers (2021)
Preview abstract
Recent advances in neural-network based generative modeling of speech has shown great potential for speech coding. However, the performance of such models drops when the input is not clean speech, e.g., in the presence of background noise, preventing its use in practical applications. In this paper we examine the reason and discuss methods to overcome this issue. Placing a denoising preprocessing stage when extracting features and target clean speech during training is shown to be the best performing strategy.
View details
GENERATIVE SPEECH CODING WITH PREDICTIVE VARIANCE REGULARIZATION
Alejandro Luebs
Andrew Storus
Bastiaan Kleijn
Michael Chinen
Tom Denton
Yero Yeh
ICASSP 2021 (2021)
Preview abstract
The recent emergence of machine-learning based generative models for speech
suggests a significant reduction in bit rate for speech codecs is
possible. However, the performance of generative models deteriorates
significantly with the distortions present in real-world input signals. We argue
that this deterioration
is due to the sensitivity of the maximum likelihood criterion to outliers and
the ineffectiveness of modeling a sum of independent signals with a single
autoregressive model. We introduce predictive-variance regularization to reduce
the sensitivity to outliers, resulting in a significant increase in performance. We
show that noise reduction to remove unwanted signals can significantly
increase performance. We provide extensive subjective performance evaluations
that show that our system based on generative modeling provides state-of-the-art coding
performance at 3 kb/s for real-world speech signals at reasonable computational complexity.
View details
Preview abstract
Rapid advances in machine-learning based generative modeling of speech make its use in speech coding attractive. However, the current performance of such models drops rapidly with noise contamination of the input, preventing use in practical applications. We present a new speech-coding scheme that is based on features that are robust to the distortions occurring in speech-coder input signals. To this purpose, we encourage the feature encoder to provide the same independent features for each of a set of linguistically equivalent signals, obtained by adding various noises to a common clean signal. The independent features, subjected to scalar quantization, are used as a conditioning vector sequence for WaveNet. Our experiments show that a 1.8 kb/s implementation of the resulting coder provides state-of-the-art performance for clean signals, and is additionally robust to noisy input.
View details
Preview abstract
Estimation of perceptual quality in audio and speech is possible using a variety of methods. The combined v3 release of ViSQOL and ViSQOLAudio (for speech and audio, respectively,) provides improvements upon previous versions, in terms of both design and usage. As an open source C++ library or binary with permissive licensing, ViSQOL can now be deployed beyond the research context into production usage. The feedback from internal production teams at Google has helped to improve this new release, and serves to show cases where it is most applicable, as well as to highlight limitations. The new model is benchmarked against real-world data for evaluation purposes. The trends and direction of future work is discussed.
View details
GENERATIVE SPEECH ENHANCEMENT BASED ON CLONED NETWORKS
Bastiaan Kleijn
Michael Chinen
IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) (2019)
Preview abstract
We propose to implement speech enhancement by the regeneration of clean speech from a `salient' representation extracted from the noisy signal. The network that extracts salient features is trained using a set of weight-sharing clones of the extractor network. The clones receive mel-frequency spectra of different noisy versions of the same speech signal as input. By encouraging the outputs of the clones to be similar for these different input signals, we train a feature extractor network that is robust to noise. At inference, the salient features form the input to a WaveNet network that generates a natural and clean speech signal with the same attributes as the ground-truth clean signal. As the signal becomes noisier, our system produces natural sounding errors that stay on the speech manifold, in place of traditional artifacts found in other systems. Our experiments confirm that our generative enhancement system provides state-of-the-art enhancement performance within the generative class of enhancers according to a MUSHRA-like test. The clones based system matches or outperforms the other systems at each input signal-to-noise (SNR) range with statistical significance.
View details
Preview abstract
We define salient features as features that are shared by signals that are defined as being equivalent by a system designer. The definition allows the designer to contribute qualitative in-
formation. We aim to find salient features that are useful as conditioning for generative networks. We extract salient features by jointly training a set of clones of an encoder network.
Each network clone receives as input a different signal from a set of equivalent signals. The objective function encourages the network clones to map their input into a set of unit-variance features that is identical across the clones. The training procedure can be unsupervised or supervised manner with a decoder that attempts to reconstruct a desired target signal. As an application, we train a system that extracts a time-sequence of feature vectors of speech and uses it as a conditioning of a WaveNet
generative system, facilitating both coding and enhancement.
View details
Wavenet based low rate speech coding
W. Bastiaan Kleijn
Alejandro Luebs
Florian Stimberg
Thomas C. Walters
arXiv preprint arXiv:1712.01120 (2017)
Preview abstract
Traditional parametric coding of speech facilitates low rate but provides poor reconstruction quality because of the inadequacy of the model used. We describe how a WaveNet generative speech model can be used to generate high quality speech from the bit stream of a standard parametric coder operating at 2.4 kb/s. We compare this parametric coder with a waveform coder based on the same generative model and show that approximating the signal waveform incurs a large rate penalty. Our experiments confirm the high performance of the WaveNet based coder and show that the speech produced by the system is able to additionally perform implicit bandwidth extension and does not significantly impair recognition of the original speaker for the human listener, even when that speaker has not been used during the training of the generative model.
View details
Robust and low-complexity blind source separation for meeting rooms
W. Bastiaan Kleijn
Proceedings Fifth Joint Workshop on Hands-free Speech Communication and Microphone Arrays (2017)
Preview abstract
The aim of this work is to provide robust, low-complexity demixing of sound sources from a set of microphone signals for a typical meeting scenario where the source mixture is relatively sparse in time.
We define a similarity matrix that characterizes the similarity of the spatial signature of the observations at different time instants within a frequency band. Each entry of the similarity matrix is the sum of a set of kernelized similarity measures, each operating on single frequency bin. The kernelization leads to high robustness as it reduces the importance of outliers. Clustering by means of affinity propagation provides the separation of talkers without the need to specify the talker number in advance. The clusters can be used directly for separation, or they can be used as a global pre-processing method that identifies sources for an adaptive demixing procedure. Our experimental results confirm the that the approach performs significantly better than two reference methods.
View details
Incoherent idempotent ambisonics rendering
W. Bastiaan Kleijn
Andrew Allen
2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (2017)
Preview abstract
We describe a family of ambisonics rendering methods that is based
on optimizing the soundfield component that lies in the null space
of the operator that maps the loudspeaker signals onto the given
ambisonics representation. In contrast to traditional rendering ap-
proaches, the new method avoids the coherent addition of loud-
speaker contributions to the sound field. As a result, it provides
a space-invariant timbre and good spatial directionality outside the
spatial region where the ambisonics soundfield description is ac-
curate. The new method is idempotent at all frequencies and has
relatively low computational complexity. Our experimental results
confirm the effectiveness of the method.
View details