Jan Skoglund
Jan Skoglund received his Ph.D. degree from Chalmers University of Technology, Sweden. From 1999
to 2000, he worked on low bit rate speech coding at AT&T Labs-Research, Florham Park, NJ. He was
with Global IP Solutions (GIPS), San Francisco, CA, from 2000 to 2011 working on speech and audio processing tailored for packet-switched networks. GIPS' audio and video technology was found in many deployments by, e.g., IBM, Google, Yahoo, WebEx, Skype, and Samsung.
Since a 2011 acquisition of GIPS he has been a part of Chrome at Google, Inc. He leads a team in San Francisco, CA, developing speech and audio signal processing components for capture, real-time communication, storage, and rendering.
Research Areas
Authored Publications
Sort By
Preview abstract
This paper presents NOMAD (Non-Matching Audio Distance), a differentiable perceptual similarity metric that measures the distance of a degraded signal against non-matching references. The proposed method is based on learning deep feature embeddings via a triplet loss guided by the Neurogram Similarity Index Measure (NSIM) to capture degradation intensity. During inference, the similarity score between any two audio samples is computed through Euclidean distance of their embeddings. NOMAD is fully unsupervised and can be used in general perceptual audio tasks for audio analysis e.g. quality assessment and generative tasks such as speech enhancement and speech synthesis. The proposed method is evaluated with 3 tasks. Ranking degradation intensity, predicting speech quality, and as a loss function for speech enhancement. Results indicate NOMAD outperforms other non-matching reference approaches in both ranking degradation intensity and quality assessment, exhibiting competitive performance with full-reference audio metrics. NOMAD demonstrates a promising technique that mimics human capabilities in assessing audio quality with non-matching references to learn perceptual embeddings without the need for human-generated labels.
View details
Preview abstract
We present a multi-channel audio signal generation scheme based on machine-learning and probabilistic modeling. We start from modeling a multi-channel single-source signal. Such signals are naturally modeled as a single-channel reference signal and a spatial-arrangement (SA) model specified by an SA parameter sequence.We focus on the SA model and assume that the reference signal is described by some parameter sequence. The SA model parameters are described with a learned probability distribution that is conditioned by the reference-signal parameter sequence and, optionally, an SA conditioning sequence. If present, the SA conditioning sequence specifies a signal class or a specific signal. The single-source method can be used for multi-source signals by applying source separation or by using an SA model that operates on non-overlapping frequency bands. Our GAN-based stereo coding implementation of the latter approach shows that our paradigm facilitates plausible high-quality rendering at a low bit rate for the SA conditioning.
View details
Convolutional Transformer for Neural Speech Coding
Hong-Goo Kang
Bastiaan Kleijn
Michael Chinen
Audio Engineering Society Convention 155 (2023)
Preview abstract
In this paper, we propose a Convolutional-Transformer speech codec (ConvT-SC) which utilizes stacks of convolutions and self-attention layers to remove redundant information at the downsampling and upsampling blocks of a U-Net-style encoder-decoder neural codec architecture.
We design the Transformers to use channel and temporal attention with any number of attention stages and heads while maintaining causality.
This allows us to take into consideration the characteristics of the input vectors and flexibly utilize temporal and channel-wise relationships at different scales when encoding the salient information that is present in speech.
This enables our model to reduce the dimensionality of its latent embeddings and improve its quantization efficiency while maintaining quality.
Experimental results demonstrate that our approach achieves significantly better performance than convolution-only baselines.
View details
A High-rate Extension to SoundStream
Andrew Storus
Hong-Goo Kang
Yero Yeh
2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) (2023)
Preview abstract
In this paper, we propose a high-rate extension of the SoundStream codec, which is able to generate almost transparent quality audio at 16 kbps for wideband speech signals. SoundStream shows reasonably good performance at low bit-rates (e.g. around 9 kbps), but its performance does not improve much when using more bits for encoding the latent embeddings. Motivated by experimental results showing that neural audio codec performance is highly related to the characteristics of latent embeddings such as dimensionality, dependency, and probability density function shape, we propose a convolutional transformer architecture and an attention-based multi-scale latent decomposition method that significantly enhances codec performance when quantizing high-dimensional embeddings.
Experimental results show the superiority of our proposed model over conventional approaches.
View details
Context-Based Evaluation of the Opus Audio Codec for Spatial Audio Content in Virtual Reality
Preview
Ben Lee
Tomasz Rudzki
Gavin Kearney
Journal of the Audio Engineering Society, 2023 April - Volume 71 Number 4 (2023)
LMCODEC: A LOW BITRATE SPEECH CODEC WITH CAUSAL TRANSFORMER MODELS
Bastiaan Kleijn
Michael Chinen
Neil Zeghidour
Teerapat Jenrungrot
ICASSP 2023 (2023)
Preview abstract
We introduce LMCodec, a fully-causal neural speech codec that provides high quality at very low bitrates. The backbone of the system is a causal convolutional codec that encodes audio into a hierarchy of coarse-to-fine tokens using residual vector quantization. LMCodec first trains a Transformer language model to predict the fine tokens from the coarse ones in a generative fashion, allowing for the transmission of fewer codes. A second Transformer predicts the uncertainty of the next codes given the past transmitted codes, and is used to perform conditional entropy coding. A MUSHRA subjective test was conducted and shows that the quality is comparable to reference codecs at higher bitrates. Example audio is available at https://google.github.io/chrome-media-audio-papers/publications/lmcodec.
View details
Twenty-Five Years of Evolution in Speech and Language Processing
Preview
Michael Picheny
Dilek Hakkani-Tur
IEEE Signal Processing Magazine, 40 (2023), pp. 27-39
Preview abstract
Non-reference speech quality models are important for a growing number of applications. The VoiceMOS 2022 challenge provided a dataset of synthetic voice conversion and text-to-speech samples with subjective labels. This study looks at the amount of variance that can be explained in subjective ratings of speech quality from metadata and the distribution imbalances of the dataset. Speech quality models were constructed using wav2vec 2.0 with additional metadata features that included rater groups and system identifiers and obtained competitive metrics including a Spearman rank correlation coefficient (SRCC) of 0.934 and MSE of 0.088 at the system-level, and 0.877 and 0.198 at the utterance-level. Using data and metadata that the test restricted or blinded further improved the metrics. A metadata analysis showed that the system-level metrics do not represent the model's system-level prediction as a result of the wide variation in the number of utterances used for each system on the validation and test datasets. We conclude that, in general, conditions should have enough utterances in the test set to bound the sample mean error, and be relatively balanced in utterance count between systems, otherwise the utterance-level metrics may be more reliable and interpretable.
View details
Preview abstract
Speech coding facilitates the transmission of speech over low-bandwidth
networks with minimal distortion. Neural-network based speech codecs
have recently demonstrated significant improvements in performance over
traditional approaches. While this new generation of codecs is capable
of synthesizing high-fidelity speech, their use of recurrent or
convolutional layers often restricts their effective receptive fields,
which prevents them from compressing speech efficiently. We propose to
further reduce the bitrate of neural speech codecs through the use of
pretrained Transformers, capable of exploiting long-range dependencies
in the input signal due to their inductive bias. Our numerical
experiments show that supplementing the encoder of a neural speech codec
with Transformer speech embeddings yields a speech codec with a bitrate
of $600\,\mathrm{bps}$ that outperforms the original neural speech codec
in synthesized speech quality when trained at the same bitrate. The
subjective human evaluations also suggest that the perceived quality of
the resulting codec is comparable or better than that of conventional
codecs operating at 3--4 times the rate.
View details
Handling Background Noise in Neural Speech Generation
Tom Denton
Alejandro Luebs
Andrew Storus
Hengchin Ye
W. Bastiaan Kleijn
2020 Asilomar Conference on Signals, Systems, and Computers (2021)
Preview abstract
Recent advances in neural-network based generative modeling of speech has shown great potential for speech coding. However, the performance of such models drops when the input is not clean speech, e.g., in the presence of background noise, preventing its use in practical applications. In this paper we examine the reason and discuss methods to overcome this issue. Placing a denoising preprocessing stage when extracting features and target clean speech during training is shown to be the best performing strategy.
View details