Felicia S. C. Lim

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    MULTI-CHANNEL AUDIO SIGNAL GENERATION
    W. Bastiaan Kleijn
    Michael Chinen
    ICASSP 2023(2023)
    Preview abstract We present a multi-channel audio signal generation scheme based on machine-learning and probabilistic modeling. We start from modeling a multi-channel single-source signal. Such signals are naturally modeled as a single-channel reference signal and a spatial-arrangement (SA) model specified by an SA parameter sequence.We focus on the SA model and assume that the reference signal is described by some parameter sequence. The SA model parameters are described with a learned probability distribution that is conditioned by the reference-signal parameter sequence and, optionally, an SA conditioning sequence. If present, the SA conditioning sequence specifies a signal class or a specific signal. The single-source method can be used for multi-source signals by applying source separation or by using an SA model that operates on non-overlapping frequency bands. Our GAN-based stereo coding implementation of the latter approach shows that our paradigm facilitates plausible high-quality rendering at a low bit rate for the SA conditioning. View details
    Handling Background Noise in Neural Speech Generation
    Tom Denton
    Alejandro Luebs
    Andrew Storus
    Hengchin Ye
    W. Bastiaan Kleijn
    2020 Asilomar Conference on Signals, Systems, and Computers(2021)
    Preview abstract Recent advances in neural-network based generative modeling of speech has shown great potential for speech coding. However, the performance of such models drops when the input is not clean speech, e.g., in the presence of background noise, preventing its use in practical applications. In this paper we examine the reason and discuss methods to overcome this issue. Placing a denoising preprocessing stage when extracting features and target clean speech during training is shown to be the best performing strategy. View details
    GENERATIVE SPEECH CODING WITH PREDICTIVE VARIANCE REGULARIZATION
    Alejandro Luebs
    Andrew Storus
    Bastiaan Kleijn
    Michael Chinen
    Tom Denton
    Yero Yeh
    ICASSP 2021(2021)
    Preview abstract The recent emergence of machine-learning based generative models for speech suggests a significant reduction in bit rate for speech codecs is possible. However, the performance of generative models deteriorates significantly with the distortions present in real-world input signals. We argue that this deterioration is due to the sensitivity of the maximum likelihood criterion to outliers and the ineffectiveness of modeling a sum of independent signals with a single autoregressive model. We introduce predictive-variance regularization to reduce the sensitivity to outliers, resulting in a significant increase in performance. We show that noise reduction to remove unwanted signals can significantly increase performance. We provide extensive subjective performance evaluations that show that our system based on generative modeling provides state-of-the-art coding performance at 3 kb/s for real-world speech signals at reasonable computational complexity. View details
    Preview abstract Rapid advances in machine-learning based generative modeling of speech make its use in speech coding attractive. However, the current performance of such models drops rapidly with noise contamination of the input, preventing use in practical applications. We present a new speech-coding scheme that is based on features that are robust to the distortions occurring in speech-coder input signals. To this purpose, we encourage the feature encoder to provide the same independent features for each of a set of linguistically equivalent signals, obtained by adding various noises to a common clean signal. The independent features, subjected to scalar quantization, are used as a conditioning vector sequence for WaveNet. Our experiments show that a 1.8 kb/s implementation of the resulting coder provides state-of-the-art performance for clean signals, and is additionally robust to noisy input. View details
    Preview abstract Estimation of perceptual quality in audio and speech is possible using a variety of methods. The combined v3 release of ViSQOL and ViSQOLAudio (for speech and audio, respectively,) provides improvements upon previous versions, in terms of both design and usage. As an open source C++ library or binary with permissive licensing, ViSQOL can now be deployed beyond the research context into production usage. The feedback from internal production teams at Google has helped to improve this new release, and serves to show cases where it is most applicable, as well as to highlight limitations. The new model is benchmarked against real-world data for evaluation purposes. The trends and direction of future work is discussed. View details
    GENERATIVE SPEECH ENHANCEMENT BASED ON CLONED NETWORKS
    Bastiaan Kleijn
    Michael Chinen
    IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)(2019)
    Preview abstract We propose to implement speech enhancement by the regeneration of clean speech from a `salient' representation extracted from the noisy signal. The network that extracts salient features is trained using a set of weight-sharing clones of the extractor network. The clones receive mel-frequency spectra of different noisy versions of the same speech signal as input. By encouraging the outputs of the clones to be similar for these different input signals, we train a feature extractor network that is robust to noise. At inference, the salient features form the input to a WaveNet network that generates a natural and clean speech signal with the same attributes as the ground-truth clean signal. As the signal becomes noisier, our system produces natural sounding errors that stay on the speech manifold, in place of traditional artifacts found in other systems. Our experiments confirm that our generative enhancement system provides state-of-the-art enhancement performance within the generative class of enhancers according to a MUSHRA-like test. The clones based system matches or outperforms the other systems at each input signal-to-noise (SNR) range with statistical significance. View details
    Preview abstract We define salient features as features that are shared by signals that are defined as being equivalent by a system designer. The definition allows the designer to contribute qualitative in- formation. We aim to find salient features that are useful as conditioning for generative networks. We extract salient features by jointly training a set of clones of an encoder network. Each network clone receives as input a different signal from a set of equivalent signals. The objective function encourages the network clones to map their input into a set of unit-variance features that is identical across the clones. The training procedure can be unsupervised or supervised manner with a decoder that attempts to reconstruct a desired target signal. As an application, we train a system that extracts a time-sequence of feature vectors of speech and uses it as a conditioning of a WaveNet generative system, facilitating both coding and enhancement. View details
    Incoherent idempotent ambisonics rendering
    W. Bastiaan Kleijn
    Andrew Allen
    2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics(2017)
    Preview abstract We describe a family of ambisonics rendering methods that is based on optimizing the soundfield component that lies in the null space of the operator that maps the loudspeaker signals onto the given ambisonics representation. In contrast to traditional rendering ap- proaches, the new method avoids the coherent addition of loud- speaker contributions to the sound field. As a result, it provides a space-invariant timbre and good spatial directionality outside the spatial region where the ambisonics soundfield description is ac- curate. The new method is idempotent at all frequencies and has relatively low computational complexity. Our experimental results confirm the effectiveness of the method. View details
    Robust and low-complexity blind source separation for meeting rooms
    W. Bastiaan Kleijn
    Proceedings Fifth Joint Workshop on Hands-free Speech Communication and Microphone Arrays(2017)
    Preview abstract The aim of this work is to provide robust, low-complexity demixing of sound sources from a set of microphone signals for a typical meeting scenario where the source mixture is relatively sparse in time. We define a similarity matrix that characterizes the similarity of the spatial signature of the observations at different time instants within a frequency band. Each entry of the similarity matrix is the sum of a set of kernelized similarity measures, each operating on single frequency bin. The kernelization leads to high robustness as it reduces the importance of outliers. Clustering by means of affinity propagation provides the separation of talkers without the need to specify the talker number in advance. The clusters can be used directly for separation, or they can be used as a global pre-processing method that identifies sources for an adaptive demixing procedure. Our experimental results confirm the that the approach performs significantly better than two reference methods. View details
    Wavenet based low rate speech coding
    W. Bastiaan Kleijn
    Alejandro Luebs
    Florian Stimberg
    Thomas C. Walters
    arXiv preprint arXiv:1712.01120(2017)
    Preview abstract Traditional parametric coding of speech facilitates low rate but provides poor reconstruction quality because of the inadequacy of the model used. We describe how a WaveNet generative speech model can be used to generate high quality speech from the bit stream of a standard parametric coder operating at 2.4 kb/s. We compare this parametric coder with a waveform coder based on the same generative model and show that approximating the signal waveform incurs a large rate penalty. Our experiments confirm the high performance of the WaveNet based coder and show that the speech produced by the system is able to additionally perform implicit bandwidth extension and does not significantly impair recognition of the original speaker for the human listener, even when that speaker has not been used during the training of the generative model. View details