Jan Skoglund

Jan Skoglund

Jan Skoglund received his Ph.D. degree from Chalmers University of Technology, Sweden. From 1999 to 2000, he worked on low bit rate speech coding at AT&T Labs-Research, Florham Park, NJ. He was with Global IP Solutions (GIPS), San Francisco, CA, from 2000 to 2011 working on speech and audio processing tailored for packet-switched networks. GIPS' audio and video technology was found in many deployments by, e.g., IBM, Google, Yahoo, WebEx, Skype, and Samsung. Since a 2011 acquisition of GIPS he has been a part of Chrome at Google, Inc. He leads a team in San Francisco, CA, developing speech and audio signal processing components for capture, real-time communication, storage, and rendering.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract This paper presents NOMAD (Non-Matching Audio Distance), a differentiable perceptual similarity metric that measures the distance of a degraded signal against non-matching references. The proposed method is based on learning deep feature embeddings via a triplet loss guided by the Neurogram Similarity Index Measure (NSIM) to capture degradation intensity. During inference, the similarity score between any two audio samples is computed through Euclidean distance of their embeddings. NOMAD is fully unsupervised and can be used in general perceptual audio tasks for audio analysis e.g. quality assessment and generative tasks such as speech enhancement and speech synthesis. The proposed method is evaluated with 3 tasks. Ranking degradation intensity, predicting speech quality, and as a loss function for speech enhancement. Results indicate NOMAD outperforms other non-matching reference approaches in both ranking degradation intensity and quality assessment, exhibiting competitive performance with full-reference audio metrics. NOMAD demonstrates a promising technique that mimics human capabilities in assessing audio quality with non-matching references to learn perceptual embeddings without the need for human-generated labels. View details
    Twenty-Five Years of Evolution in Speech and Language Processing
    Michael Picheny
    Bhuvana Ramabhadran
    Dilek Hakkani-Tur
    IEEE Signal Processing Magazine, 40(2023), pp. 27-39
    Preview
    Convolutional Transformer for Neural Speech Coding
    Hong-Goo Kang
    Bastiaan Kleijn
    Michael Chinen
    Audio Engineering Society Convention 155(2023)
    Preview abstract In this paper, we propose a Convolutional-Transformer speech codec (ConvT-SC) which utilizes stacks of convolutions and self-attention layers to remove redundant information at the downsampling and upsampling blocks of a U-Net-style encoder-decoder neural codec architecture. We design the Transformers to use channel and temporal attention with any number of attention stages and heads while maintaining causality. This allows us to take into consideration the characteristics of the input vectors and flexibly utilize temporal and channel-wise relationships at different scales when encoding the salient information that is present in speech. This enables our model to reduce the dimensionality of its latent embeddings and improve its quantization efficiency while maintaining quality. Experimental results demonstrate that our approach achieves significantly better performance than convolution-only baselines. View details
    A High-rate Extension to SoundStream
    Andrew Storus
    Hong-Goo Kang
    Yero Yeh
    2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)(2023)
    Preview abstract In this paper, we propose a high-rate extension of the SoundStream codec, which is able to generate almost transparent quality audio at 16 kbps for wideband speech signals. SoundStream shows reasonably good performance at low bit-rates (e.g. around 9 kbps), but its performance does not improve much when using more bits for encoding the latent embeddings. Motivated by experimental results showing that neural audio codec performance is highly related to the characteristics of latent embeddings such as dimensionality, dependency, and probability density function shape, we propose a convolutional transformer architecture and an attention-based multi-scale latent decomposition method that significantly enhances codec performance when quantizing high-dimensional embeddings. Experimental results show the superiority of our proposed model over conventional approaches. View details
    LMCODEC: A LOW BITRATE SPEECH CODEC WITH CAUSAL TRANSFORMER MODELS
    Bastiaan Kleijn
    Michael Chinen
    Neil Zeghidour
    Teerapat Jenrungrot
    ICASSP 2023(2023)
    Preview abstract We introduce LMCodec, a fully-causal neural speech codec that provides high quality at very low bitrates. The backbone of the system is a causal convolutional codec that encodes audio into a hierarchy of coarse-to-fine tokens using residual vector quantization. LMCodec first trains a Transformer language model to predict the fine tokens from the coarse ones in a generative fashion, allowing for the transmission of fewer codes. A second Transformer predicts the uncertainty of the next codes given the past transmitted codes, and is used to perform conditional entropy coding. A MUSHRA subjective test was conducted and shows that the quality is comparable to reference codecs at higher bitrates. Example audio is available at https://google.github.io/chrome-media-audio-papers/publications/lmcodec. View details
    Context-Based Evaluation of the Opus Audio Codec for Spatial Audio Content in Virtual Reality
    Ben Lee
    Tomasz Rudzki
    Gavin Kearney
    Journal of the Audio Engineering Society, 2023 April - Volume 71 Number 4(2023)
    Preview
    MULTI-CHANNEL AUDIO SIGNAL GENERATION
    W. Bastiaan Kleijn
    Michael Chinen
    ICASSP 2023(2023)
    Preview abstract We present a multi-channel audio signal generation scheme based on machine-learning and probabilistic modeling. We start from modeling a multi-channel single-source signal. Such signals are naturally modeled as a single-channel reference signal and a spatial-arrangement (SA) model specified by an SA parameter sequence.We focus on the SA model and assume that the reference signal is described by some parameter sequence. The SA model parameters are described with a learned probability distribution that is conditioned by the reference-signal parameter sequence and, optionally, an SA conditioning sequence. If present, the SA conditioning sequence specifies a signal class or a specific signal. The single-source method can be used for multi-source signals by applying source separation or by using an SA model that operates on non-overlapping frequency bands. Our GAN-based stereo coding implementation of the latter approach shows that our paradigm facilitates plausible high-quality rendering at a low bit rate for the SA conditioning. View details
    Using Rater and System Metadata to Explain Variance in the VoiceMOS Dataset
    Alessandro Ragano
    Andrew Hines
    Chandan K. Reddy
    Michael Chinen
    Interspeech 2022
    Preview abstract Non-reference speech quality models are important for a growing number of applications. The VoiceMOS 2022 challenge provided a dataset of synthetic voice conversion and text-to-speech samples with subjective labels. This study looks at the amount of variance that can be explained in subjective ratings of speech quality from metadata and the distribution imbalances of the dataset. Speech quality models were constructed using wav2vec 2.0 with additional metadata features that included rater groups and system identifiers and obtained competitive metrics including a Spearman rank correlation coefficient (SRCC) of 0.934 and MSE of 0.088 at the system-level, and 0.877 and 0.198 at the utterance-level. Using data and metadata that the test restricted or blinded further improved the metrics. A metadata analysis showed that the system-level metrics do not represent the model's system-level prediction as a result of the wide variation in the number of utterances used for each system on the validation and test datasets. We conclude that, in general, conditions should have enough utterances in the test set to bound the sample mean error, and be relatively balanced in utterance count between systems, otherwise the utterance-level metrics may be more reliable and interpretable. View details
    Ultra Low-Bitrate Speech Coding with Pretrained Transformers
    Ali Siakoohi
    Bastiaan Kleijn
    Michael Chinen
    Tom Denton
    Interspeech 2022
    Preview abstract Speech coding facilitates the transmission of speech over low-bandwidth networks with minimal distortion. Neural-network based speech codecs have recently demonstrated significant improvements in performance over traditional approaches. While this new generation of codecs is capable of synthesizing high-fidelity speech, their use of recurrent or convolutional layers often restricts their effective receptive fields, which prevents them from compressing speech efficiently. We propose to further reduce the bitrate of neural speech codecs through the use of pretrained Transformers, capable of exploiting long-range dependencies in the input signal due to their inductive bias. Our numerical experiments show that supplementing the encoder of a neural speech codec with Transformer speech embeddings yields a speech codec with a bitrate of $600\,\mathrm{bps}$ that outperforms the original neural speech codec in synthesized speech quality when trained at the same bitrate. The subjective human evaluations also suggest that the perceived quality of the resulting codec is comparable or better than that of conventional codecs operating at 3--4 times the rate. View details
    SoundStream: An End-to-End Neural Audio Codec
    Neil Zeghidour
    Alejandro Luebs
    Transactions on Audio, Speech and Language Processing(2021)
    Preview abstract We present SoundStream, a novel neural audio codec that can efficiently compress speech, music and general audio at bitrates normally targeted by speech-tailored codecs. SoundStream relies on a model architecture composed by a fully convolutional encoder/decoder network and a residual vector quantizer, which are trained jointly end-to-end. Training leverages recent advances in text-to-speech and speech enhancement, which combine adversarial and reconstruction losses to allow the generation of high-quality audio content from quantized embeddings. By training with structured dropout applied to quantizer layers, a single model can operate across variable bitrates from 3 kbps to 18 kbps, with a negligible quality loss when compared with models trained at fixed bitrates. In addition, the model is amenable to a low latency implementation, which supports streamable inference and runs in real time on a smartphone CPU. In subjective evaluations using audio at 24 kHz sampling rate, SoundStream at 3 kbps outperforms Opus at 12 kbps and approaches EVS at 9.6 kbps. Moreover, we are able to perform joint compression and enhancement either at the encoder or at the decoder side with no additional latency, which we demonstrate through background noise suppression for speech. View details
    Handling Background Noise in Neural Speech Generation
    Tom Denton
    Alejandro Luebs
    Andrew Storus
    Hengchin Ye
    W. Bastiaan Kleijn
    2020 Asilomar Conference on Signals, Systems, and Computers(2021)
    Preview abstract Recent advances in neural-network based generative modeling of speech has shown great potential for speech coding. However, the performance of such models drops when the input is not clean speech, e.g., in the presence of background noise, preventing its use in practical applications. In this paper we examine the reason and discuss methods to overcome this issue. Placing a denoising preprocessing stage when extracting features and target clean speech during training is shown to be the best performing strategy. View details
    GENERATIVE SPEECH CODING WITH PREDICTIVE VARIANCE REGULARIZATION
    Alejandro Luebs
    Andrew Storus
    Bastiaan Kleijn
    Michael Chinen
    Tom Denton
    Yero Yeh
    ICASSP 2021(2021)
    Preview abstract The recent emergence of machine-learning based generative models for speech suggests a significant reduction in bit rate for speech codecs is possible. However, the performance of generative models deteriorates significantly with the distortions present in real-world input signals. We argue that this deterioration is due to the sensitivity of the maximum likelihood criterion to outliers and the ineffectiveness of modeling a sum of independent signals with a single autoregressive model. We introduce predictive-variance regularization to reduce the sensitivity to outliers, resulting in a significant increase in performance. We show that noise reduction to remove unwanted signals can significantly increase performance. We provide extensive subjective performance evaluations that show that our system based on generative modeling provides state-of-the-art coding performance at 3 kb/s for real-world speech signals at reasonable computational complexity. View details
    Partial Monotonic Speech Quality Estimation in ViSQOL with Deep Lattice Networks
    Andrew Hines
    Michael Chinen
    Journal of the Acoustical Society of America, 149(2021), pp. 3851-3861
    Preview abstract When predicting subjective quality as mean opinion score (MOS) for speech, a raw similarity score is often mapped onto the score dimension with a mapping function. Virtual Speech Quality Objective Listener (ViSQOL) uses monotonic one-dimensional mappings to evaluate speech. More recent models such as support vector regression (SVR) or deep neural networks (DNNs) use multidimensional input, which allows for a more accurate prediction, but do not provide the monotonic property that is expected. We propose to integrate a multi-dimensional mapping function using deep lattice networks (DLNs) into ViSQOL. DLNs also provide some insight into model interpretation and are robust to overfitting, leading to better out-of-sample performance. With the DLN, ViSQOL improved the speech mapping from the previous exponential mapping's .58 MSE to .24 MSE on a mixture of datasets, outperforming the 1-D fitted functions, SVR, as well as PESQ and POLQA. Additionally, we show that the DLN can be used to learn a quantile function that is well calibrated and a useful measure of uncertainty. With this quantile function, the model is able to provide useful quantile intervals for predictions instead of point intervals. View details
    WARP-Q: Quality Prediction For Generative Neural Speech Codecs
    Andrew Hines
    Michael Chinen
    Wissam Jassim
    ICASSP 2021(2021)
    Preview abstract Speech coding has been shown to achieve good speech quality using either waveform matching or parametric reconstruction. For very low bitrate streams, recently developed generative speech models can reconstruct high quality wide band speech from the bit streams of standard parametric encoders at less than 3 kb/s. Generative codecs create high quality codec speech based on synthesising speech from a DNN and the parametric input. Existing objective speech quality models (e.g. ViSQOL, POLQA) cannot be used to accurately evaluate the quality of generatively coded speech as they penalise them based on signal differences not apparent in subjective listening test results. This paper presents \NEWMODEL{}, a full-reference objective speech quality metric that uses dynamic time warping cost for MFCC representations of the signals. It is robust to the codec changes introduced by low-bitrate neural vocoders. Evaluation using waveform matching, parametric and generative neural vocoder based codecs as well as channel and environmental noise shows that \NEWMODEL{} has better correlation and codec quality ranking for novel codecs compared to traditional metrics as well as veritiltiy and potential for additive noise and channel degradations. View details
    Preview abstract Rapid advances in machine-learning based generative modeling of speech make its use in speech coding attractive. However, the current performance of such models drops rapidly with noise contamination of the input, preventing use in practical applications. We present a new speech-coding scheme that is based on features that are robust to the distortions occurring in speech-coder input signals. To this purpose, we encourage the feature encoder to provide the same independent features for each of a set of linguistically equivalent signals, obtained by adding various noises to a common clean signal. The independent features, subjected to scalar quantization, are used as a conditioning vector sequence for WaveNet. Our experiments show that a 1.8 kb/s implementation of the resulting coder provides state-of-the-art performance for clean signals, and is additionally robust to noisy input. View details
    Preview abstract Estimation of perceptual quality in audio and speech is possible using a variety of methods. The combined v3 release of ViSQOL and ViSQOLAudio (for speech and audio, respectively,) provides improvements upon previous versions, in terms of both design and usage. As an open source C++ library or binary with permissive licensing, ViSQOL can now be deployed beyond the research context into production usage. The feedback from internal production teams at Google has helped to improve this new release, and serves to show cases where it is most applicable, as well as to highlight limitations. The new model is benchmarked against real-world data for evaluation purposes. The trends and direction of future work is discussed. View details
    Preview abstract This study compares the performances of different algorithms for coding speech at low bit rates. In addition to widely deployed traditional vocoders, a selection of recently developed generative-model-based coders at different bit rates are contrasted. Performance analysis of the coded speech is evaluated for different quality aspects: accuracy of pitch periods estimation, the word error rates for automatic speech recognition, and the influence of speaker gender and coding delays. A number of performance metrics of speech samples taken from a publicly available database were compared with subjective scores. Results from subjective quality assessment do not correlate well with existing full reference speech quality metrics. The results provide valuable insights into aspects of the speech signal that will be used to develop a novel metric to accurately predict speech quality from generative-model-based coders. View details
    GENERATIVE SPEECH ENHANCEMENT BASED ON CLONED NETWORKS
    Bastiaan Kleijn
    Michael Chinen
    IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)(2019)
    Preview abstract We propose to implement speech enhancement by the regeneration of clean speech from a `salient' representation extracted from the noisy signal. The network that extracts salient features is trained using a set of weight-sharing clones of the extractor network. The clones receive mel-frequency spectra of different noisy versions of the same speech signal as input. By encouraging the outputs of the clones to be similar for these different input signals, we train a feature extractor network that is robust to noise. At inference, the salient features form the input to a WaveNet network that generates a natural and clean speech signal with the same attributes as the ground-truth clean signal. As the signal becomes noisier, our system produces natural sounding errors that stay on the speech manifold, in place of traditional artifacts found in other systems. Our experiments confirm that our generative enhancement system provides state-of-the-art enhancement performance within the generative class of enhancers according to a MUSHRA-like test. The clones based system matches or outperforms the other systems at each input signal-to-noise (SNR) range with statistical significance. View details
    Preview abstract Neural speech synthesis algorithms are a promising new approach for coding speech at very low bitrate. They have so far demonstrated quality that far exceeds traditional vocoders, at the cost of very high complexity. In this work, we present a low bit rate neural vocoder based on the LPCNet model. The use of linear prediction and sparse recurrent networks makes it possible to achieve real-time operation on general-purpose hardware. We demonstrate that LPCNet operating at 1.6 kb/s achieves significantly higher quality than MELP and that uncompressed LPCNet can exceed the quality of a waveform codec operating at low bitrate. This opens the way for new codec designs based on neural synthesis models. View details
    Auditory Localization in Low Bitrate Compressed Ambisonic Scenes
    Tomasz Rudzki
    Ignacio Gomez-Lanzaco
    Jessica Stubbs
    Gavin Kearney
    Damian Murphy
    Applied Sciences, 9 (13)(2019)
    Preview abstract The increasing popularity of Ambisonics as a spatial audio format for streaming services poses new challenges to existing audio coding techniques. Immersive audio delivered to mobile devices requires an efficient bitrate compression that does not affect the spatial quality of the content. Good localizability of virtual sound sources is one of the key elements that has to be preserved. This study was conducted in order to investigate the localization precision of virtual sound source presentations within Ambisonic scenes encoded with Opus low bitrate compression at different bitrates and Ambisonic orders (1st, 3rd, and 5th). The test stimuli were reproduced over a 50-channel spherical loudspeaker configuration and binaurally using individually measured and generic HRTFs. Participants were asked to adjust the position of a virtual acoustic pointer to match the position of virtual sound source within the bitrate compressed Ambisonic scene. Results show that auditory localization in low bitrate compressed Ambisonic scenes is not significantly affected by codec parameters. The key factors influencing localization are the rendering method and Ambisonic order truncation. This suggests that efficient perceptual coding might be successfully used for mobile spatial audio delivery. View details
    Preview abstract Neural speech synthesis models have recently demonstrated the ability to synthesize high quality speech for text-to-speech and compression applications. These new models often require powerful GPUs to achieve real-time operation, so being able to reduce their complexity would open the way for many new applications. We propose LPCNet, a WaveRNN variant that combines linear prediction with recurrent neural networks to significantly improve the efficiency of speech synthesis. We demonstrate that LPCNet can achieve significantly higher quality than WaveRNN for the same network size and that high quality LPCNet speech synthesis is achievable with a complexity under 3 GFLOPS. This makes it easier to deploy neural synthesis applications on lower-power devices, such as embedded systems and mobile phones. View details
    Preview abstract We define salient features as features that are shared by signals that are defined as being equivalent by a system designer. The definition allows the designer to contribute qualitative in- formation. We aim to find salient features that are useful as conditioning for generative networks. We extract salient features by jointly training a set of clones of an encoder network. Each network clone receives as input a different signal from a set of equivalent signals. The objective function encourages the network clones to map their input into a set of unit-variance features that is identical across the clones. The training procedure can be unsupervised or supervised manner with a decoder that attempts to reconstruct a desired target signal. As an application, we train a system that extracts a time-sequence of feature vectors of speech and uses it as a conditioning of a WaveNet generative system, facilitating both coding and enhancement. View details
    EXPLORING TRADEOFFS IN MODELS FOR LOW-LATENCY SPEECH ENHANCEMENT
    Jeremy Thorpe
    Michael Chinen
    Proceedings of the 16th International Workshop on Acoustic Signal Enhancement(2018)
    Preview abstract We explore a variety of configurations of neural networks for one- and two-channel spectrogram-mask-based speech enhancement. Our best model improves on state-of-the-art performance on the CHiME2 speech enhancement task. We examine trade-offs among non-causal lookahead, compute work, and parameter count versus enhancement performance and find that zero-lookahead models can achieve, on average, only 0.5 dB worse performance than our best bidirectional model. Further, we find that 200 milliseconds of lookahead is sufficient to achieve performance within about 0.2 dB from our best bidirectional model. View details
    AMBIQUAL – a full reference objective quality metric for ambisonic spatial audio
    Andrew Hines
    Drew Allen
    Michael Chinen
    Miroslaw Narbutt
    QoMEX(2018)
    Preview abstract Streaming spatial audio over networks requires efficient encoding techniques that compress the raw audio content without compromising quality of experience. Streaming service providers such as YouTube need a perceptually relevant objective audio quality metric to monitor users’ perceived quality and spatial localization accuracy. In this paper we introduce a full reference objective spatial audio quality metric, AMBIQUAL, which assesses both Listening Quality and Localization Accuracy. In our solution both metrics are derived directly from the B-format Ambisonic audio. The metric extends and adapts the algorithm used in ViSQOLAudio, a full reference objective metric designed for assessing speech and audio quality. In particular, Listening Quality is derived from the omnidirectional channel and Localization Accuracy is derived from a weighted sum of similarity from B-format directional channels. This paper evaluates whether the proposed AMBIQUAL objective spatial audio quality metric can predict two factors: Listening Quality and Localization Accuracy by comparing its predictions with results from MUSHRA subjective listening tests. In particular, we evaluated the Listening Quality and Localization Accuracy of First and Third-Order Ambisonic audio compressed with the OPUS 1.2 codec at various bitrates (i.e. 32, 128 and 256, 512kbps respectively). The sample set for the tests comprised both recorded and synthetic audio clips with a wide range of time-frequency characteristics. To evaluate Localization Accuracy of compressed audio a number of fixed and dynamic (moving vertically and horizontally) source positions were selected for the test samples. Results showed a strong correlation (PCC=0.919; Spearman=0.882 regarding Listening Quality and PCC=0.854; Spearman=0.842 regarding Localization Accuracy) between objective quality scores derived from the B-format Ambisonic audio using AMBIQUAL and subjective scores obtained during listening MUSHRA tests. AMBIQUAL displays very promising quality assessment predictions for spatial audio. Future work will optimise the algorithm to generalise and validate it for any Higher Order Ambisonic formats. View details
    Phase-sensitive Joint Learning Algorithms for Deep Learning-based Speech Enhancement
    Hong-Goo Kang
    Jinkyu Lee
    Turaj Zakizadeh Shabestary
    IEEE Signal Processing Letters, 25 (8)(2018), pp. 1276-1280
    Preview abstract This letter presents a phase-sensitive joint learning algorithm for single-channel speech enhancement. Although a deep learning framework that estimates time-frequency (T-F) domain ideal ratio masks demonstrates a strong performance, it is limited in that the enhancement process is performed only in the magnitude domain, while the phase spectra remain unchanged. Thus, recent studies have been conducted to involve phase spectra in speech enhancement systems. A phase-sensitive mask (PSM) is a T-F mask that implicitly represents phaserelated information. However, since the PSM has an unbounded value, the networks are trained to target its truncated values rather than directly estimating it. To effectively train the PSM, we first approximate it to have a bounded dynamic range under the assumption that speech and noise are uncorrelated. We then propose a joint learning algorithm that trains the approximated value through its parameterized variables in order to minimize the inevitable error caused by the truncation process. Specifically, we design a network that explicitly targets three parameterized variables: speech magnitude spectra, noise magnitude spectra, and phase difference of clean to noisy spectra. To further improve the performance, we also investigate how the dynamic range of magnitude spectra controlled by a warping function affects the final performance in joint learning algorithms. Finally, we examined how the proposed additional constraint that preserves the sum of the estimated speech and noise power spectra affects the overall system performance. The experimental results show that the proposed learning algorithm outperforms the conventional learning algorithm with the truncated phase-sensitive approximation. View details
    Incoherent idempotent ambisonics rendering
    W. Bastiaan Kleijn
    Andrew Allen
    2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics(2017)
    Preview abstract We describe a family of ambisonics rendering methods that is based on optimizing the soundfield component that lies in the null space of the operator that maps the loudspeaker signals onto the given ambisonics representation. In contrast to traditional rendering ap- proaches, the new method avoids the coherent addition of loud- speaker contributions to the sound field. As a result, it provides a space-invariant timbre and good spatial directionality outside the spatial region where the ambisonics soundfield description is ac- curate. The new method is idempotent at all frequencies and has relatively low computational complexity. Our experimental results confirm the effectiveness of the method. View details
    Wavenet based low rate speech coding
    W. Bastiaan Kleijn
    Alejandro Luebs
    Florian Stimberg
    Thomas C. Walters
    arXiv preprint arXiv:1712.01120(2017)
    Preview abstract Traditional parametric coding of speech facilitates low rate but provides poor reconstruction quality because of the inadequacy of the model used. We describe how a WaveNet generative speech model can be used to generate high quality speech from the bit stream of a standard parametric coder operating at 2.4 kb/s. We compare this parametric coder with a waveform coder based on the same generative model and show that approximating the signal waveform incurs a large rate penalty. Our experiments confirm the high performance of the WaveNet based coder and show that the speech produced by the system is able to additionally perform implicit bandwidth extension and does not significantly impair recognition of the original speaker for the human listener, even when that speaker has not been used during the training of the generative model. View details
    Joint Wideband Source Localization and Acquisition Based on a Grid-Shift Approach
    Christos Tzagkarakis
    Bastiaan Kleijn
    2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics(2017)
    Preview abstract This paper addresses the problem of joint wideband localization and acquisition of acoustic sources. The source locations as well as acquisition of the original source signals are obtained in a joint fashion by solving a sparse recovery problem. Spatial sparsity is enforced by discretizing the acoustic scene into a grid of predefined dimensions. In practice, energy leakage from the source location to the neighboring grid points is expected to produce spurious location estimates, since the source location will not coincide with one of the grid points. To alleviate this problem we introduce the concept of grid-shift. A particular source is then near a point on the grid in at least one of a set of shifted grids. For the selected grid, other sources will generally not be on a grid point, but their energy is distributed over many points. A large number of experiments on real speech signals show the localization and acquisition effectiveness of the proposed approach under clean, noisy and reverberant conditions View details
    Preview abstract This paper presents a practically efficient implementation for nonlinear acoustic echo cancellation (NAEC). The echo path is modeled by a novel hybrid Taylor-Volterra pre-processor followed by a linear FIR filter. A cascaded block RLS and unconstrained FLMS adaptive algorithm is developed to jointly identify the pre-processor and the FIR filter. This implementation is validated via simulations. View details
    Preview abstract This paper presents a new paradigm for acoustic echo control on mobile Android devices. The echo path on these devices has nonlinearities including not only the results of overdriven power amplifiers and miniaturized loudspeakers, but also those caused by hardware audio dynamic range compressor (ADRC). While the former form of nonlinearities was widely investigated in past research, the latter has not yet been taken into account. The ADRC adds extra gains to the echo path and makes it become a fast time-varying system. This presents a great challenge to traditional (both linear and nonlinear) echo cancellation systems. Here we propose a novel bi-magnitude processing framework, which is based on a two-state model for the echo path. The algorithm can deal with the ADRC problem and offers robust control for identification of input nonlinearities. The performance of the proposed approach is evaluated on recordings made in an anechoic chamber using real Android devices. View details
    GLOBALLY OPTIMIZED LEAST-SQUARES POST-FILTERING FOR MICROPHONE ARRAY SPEECH ENHANCEMENT
    Yiteng (Arden) Huang
    Alejandro Luebs
    W. Bastiaan Kleijn
    Proc. ICASSP(2016), pp. 380-384
    Preview abstract Existing post-filtering techniques for microphone array speech enhancement have two common deficiencies. First, they assume that the noise is either white or diffuse and cannot deal with point interferers. Second, they estimate the post-filter coefficients using only two microphones at a time and then perform averaging over all microphone pairs, yielding a suboptimal solution at best. In this paper, we present a novel post-filtering algorithm that alleviates the first limitation by using a more generalized signal model including not only white and diffuse but also point interferers, and overcomes the second deficiency by offering a globally optimized least-squares solution over all microphones. It is shown by simulations that the proposed method outperforms the existing algorithms in many different acoustic scenarios. View details
    Robust Estimation of Reverberation Time Using Polynomial Roots
    Ian Kelly
    Francis Boland
    AES 60th Conference on Dereverberation and Reverberation of Audio, Music, and Speech, Google Ireland Ltd.(2016)
    Preview abstract This paper further investigates previous findings that coefficients of acoustic responses can be modelled as random polynomials with certain constraints applied. In the case of room impulse responses, the median value of their clustered roots has been shown to be directly related to the reverberation time of the room. In this paper we examine the frequency dependency of reverberation time and we also demonstrate the method’s robustness to truncation of impulse responses. View details
    ON PRE-FILTERING STRATEGIES FOR THE GCC-PHAT ALGORITHM
    Hong-Goo Kang
    Michael Graczyk
    International Workshop on Acoustic Signal Enhancement 2016 (IWAENC 2016)
    Preview
    ViSQOL: an objective speech quality model
    Andrew Hines
    Anil Kokaram
    Naomi Harte
    EURASIP Journal on Audio, Speech, and Music Processing, 2015 (13)(2015), pp. 1-18
    Preview abstract This paper presents an objective speech quality model, ViSQOL, the Virtual Speech Quality Objective Listener. It is a signal-based, full-reference, intrusive metric that models human speech quality perception using a spectro-temporal measure of similarity between a reference and a test speech signal. The metric has been particularly designed to be robust for quality issues associated with Voice over IP (VoIP) transmission. This paper describes the algorithm and compares the quality predictions with the ITU-T standard metrics PESQ and POLQA for common problems in VoIP: clock drift, associated time warping, and playout delays. The results indicate that ViSQOL and POLQA significantly outperform PESQ, with ViSQOL competing well with POLQA. An extensive benchmarking against PESQ, POLQA, and simpler distance metrics using three speech corpora (NOIZEUS and E4 and the ITU-T P.Sup. 23 database) is also presented. These experiments benchmark the performance for a wide range of quality impairments, including VoIP degradations, a variety of background noise types, speech enhancement methods, and SNR levels. The results and subsequent analysis show that both ViSQOL and POLQA have some performance weaknesses and under-predict perceived quality in certain VoIP conditions. Both have a wider application and robustness to conditions than PESQ or more trivial distance metrics. ViSQOL is shown to offer a useful alternative to POLQA in predicting speech quality in VoIP scenarios. View details
    ViSQOLAudio: An objective audio quality metric for low bitrate codecs
    Andrew Hines
    Eoin Gillen
    Anil Kokaram
    Naomi Harte
    The Journal of the Acoustical Society of America, 137 (6)(2015), EL449-EL455
    Preview abstract Streaming services seek to optimise their use of bandwidth across audio and visual channels to maximise the quality of experience for users. This letter evaluates whether objective quality metrics can predict the audio quality for music encoded at low bitrates by comparing objective predictions with results from listener tests. Three objective metrics were benchmarked: PEAQ, POLQA, and VISQOLAudio. The results demonstrate objective metrics designed for speech quality assessment have a strong potential for quality assessment of low bitrate audio codecs. View details
    An Analysis of the Effect of Larynx-Synchronous Averaging on Dereverberation of Voiced Speech
    Alastair H Moore
    Patrick A Naylor
    Proceedings of European Signal Processing Conference (EUSIPCO) 2014
    Preview
    Sinusoidal Interpolation Across Missing Data
    W. Bastiaan Kleijn
    Turaj Zakizadeh Shabestary
    International Workshop on Acoustic Signal Enhancement 2014 (IWAENC 2014), pp. 71-75
    Preview
    Robustness of Speech Quality Metrics to Background Noise and Network Degradations: Comparing VISQOL, PESQ and POLQA
    Andrew Hines
    Anil Kokaram
    Naomi Harte
    IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE(2013), pp. 3697-3701
    Preview
    Monitoring the Effects of Temporal Clipping on VoIP Speech Quality
    Andrew Hines
    Anil Kokaram
    Naomi Harte
    Interspeech 2013, pp. 1188-1192
    Preview
    Rate-Distortion Optimization for Multichannel Audio Compression
    Minyue Li
    W. Bastiaan Kleijn
    2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)
    Preview abstract Multichannel audio coding is studied from a rate-distortion theoret- ical viewpoint. Two practical coding techniques, both of which are based on rate-distortion optimization, are also proposed. The first technique decorrelates a multichannel signal hierarchically using el- ementary unitary transforms. The second method rearranges a mul- tichannel signal into sub-signals and compresses them at optimized bit rates using a conventional codec. Both objective and subjective tests were conducted to illustrate the efficiency of the methods. View details
    IMPROVED PREDICTION OF NEARLY-PERIODIC SIGNALS
    Bastiaan Kleijn
    International Workshop on Acoustic Signal Enhancement 2012 (IWAENC2012)
    Preview
    VISQOL: THE VIRTUAL SPEECH QUALITY OBJECTIVE LISTENER
    Andrew Hines
    Anil Kokaram
    Naomi Harte
    International Workshop on Acoustic Signal Enhancement 2012 (IWAENC2012)
    Preview
    Summary of Opus listening test results
    Christian Hoene
    Jean-Marc Valin
    Koen Vos
    IETF, IETF(2011)
    Preview abstract This document describes and examines listening test results obtained for the Opus codec and how they relate to the requirements. View details
    Voice over IP: Speech Transmission over Packet Networks
    Ermin Kozica
    Jan Linden
    Roar Hagen
    W. Bastiaan Kleijn
    Handbook of Speech Processing, Springer, Heidelberg(2008), 307–330
    iLBC - A Linear Predictive Coder with Robustness to Packet Losses
    Søren Vang Andersen
    W Bastiaan Kleijn
    Roar Hagen
    Jan Linden
    Manohar N. Murthi
    2002 IEEE Speech Coding Workshop, IEEE