Jan Skoglund
Jan Skoglund received his Ph.D. degree from Chalmers University of Technology, Sweden. From 1999
to 2000, he worked on low bit rate speech coding at AT&T Labs-Research, Florham Park, NJ. He was
with Global IP Solutions (GIPS), San Francisco, CA, from 2000 to 2011 working on speech and audio processing tailored for packet-switched networks. GIPS' audio and video technology was found in many deployments by, e.g., IBM, Google, Yahoo, WebEx, Skype, and Samsung.
Since a 2011 acquisition of GIPS he has been a part of Chrome at Google, Inc. He leads a team in San Francisco, CA, developing speech and audio signal processing components for capture, real-time communication, storage, and rendering.
Research Areas
Authored Publications
Sort By
Preview abstract
This paper presents NOMAD (Non-Matching Audio Distance), a differentiable perceptual similarity metric that measures the distance of a degraded signal against non-matching references. The proposed method is based on learning deep feature embeddings via a triplet loss guided by the Neurogram Similarity Index Measure (NSIM) to capture degradation intensity. During inference, the similarity score between any two audio samples is computed through Euclidean distance of their embeddings. NOMAD is fully unsupervised and can be used in general perceptual audio tasks for audio analysis e.g. quality assessment and generative tasks such as speech enhancement and speech synthesis. The proposed method is evaluated with 3 tasks. Ranking degradation intensity, predicting speech quality, and as a loss function for speech enhancement. Results indicate NOMAD outperforms other non-matching reference approaches in both ranking degradation intensity and quality assessment, exhibiting competitive performance with full-reference audio metrics. NOMAD demonstrates a promising technique that mimics human capabilities in assessing audio quality with non-matching references to learn perceptual embeddings without the need for human-generated labels.
View details
Preview abstract
This paper explores the integration of model-based and data-driven approaches within the realm of neural speech and audio coding systems. It highlights the challenges posed by the subjective evaluation processes of speech and audio codecs and discusses the limitations of purely data-driven approaches, which often require inefficiently large architectures to match the performance of model-based methods. The study presents hybrid systems as a viable solution, offering significant improvements to the performance of conventional codecs through meticulously chosen design enhancements. Specifically, it introduces a neural network-based signal enhancer designed to post-process existing codecs’ output, along with the autoencoder-based end-to-end models and LPCNet—hybrid systems that combine linear predictive coding (LPC) with neural networks. Furthermore, the paper delves into predictive models operating within custom feature spaces (TF-Codec) or predefined transform domains (MDCTNet) and examines the use of psychoacoustically calibrated loss functions to train end-to-end neural audio codecs. Through these investigations, the paper demonstrates the potential of hybrid systems to advance the field of speech and audio coding by bridging the gap between traditional model-based approaches and modern data-driven techniques.
View details
Preview abstract
In this paper, we present SCOREQ, a novel approach for speech quality prediction. SCOREQ is a triplet loss function for contrastive regression that addresses the domain generalisation shortcoming exhibited by state of the art no-reference speech quality metrics. In the paper we: (i) illustrate the problem of L2 loss training failing at capturing the continuous nature of the mean opinion score (MOS) labels; (ii) demonstrate the lack of generalisation through a benchmarking evaluation across several speech domains; (iii) outline our approach and explore the impact of the architectural design decisions through incremental evaluation; (iv) evaluate the final model against state of the art models for a wide variety of data and domains. The results show that the lack of generalisation observed in state of the art speech quality metrics is addressed by SCOREQ. We conclude that using a triplet loss function
View details
Preview abstract
We present a multi-channel audio signal generation scheme based on machine-learning and probabilistic modeling. We start from modeling a multi-channel single-source signal. Such signals are naturally modeled as a single-channel reference signal and a spatial-arrangement (SA) model specified by an SA parameter sequence.We focus on the SA model and assume that the reference signal is described by some parameter sequence. The SA model parameters are described with a learned probability distribution that is conditioned by the reference-signal parameter sequence and, optionally, an SA conditioning sequence. If present, the SA conditioning sequence specifies a signal class or a specific signal. The single-source method can be used for multi-source signals by applying source separation or by using an SA model that operates on non-overlapping frequency bands. Our GAN-based stereo coding implementation of the latter approach shows that our paradigm facilitates plausible high-quality rendering at a low bit rate for the SA conditioning.
View details
Convolutional Transformer for Neural Speech Coding
Hong-Goo Kang
Bastiaan Kleijn
Michael Chinen
Audio Engineering Society Convention 155 (2023)
Preview abstract
In this paper, we propose a Convolutional-Transformer speech codec (ConvT-SC) which utilizes stacks of convolutions and self-attention layers to remove redundant information at the downsampling and upsampling blocks of a U-Net-style encoder-decoder neural codec architecture.
We design the Transformers to use channel and temporal attention with any number of attention stages and heads while maintaining causality.
This allows us to take into consideration the characteristics of the input vectors and flexibly utilize temporal and channel-wise relationships at different scales when encoding the salient information that is present in speech.
This enables our model to reduce the dimensionality of its latent embeddings and improve its quantization efficiency while maintaining quality.
Experimental results demonstrate that our approach achieves significantly better performance than convolution-only baselines.
View details
Context-Based Evaluation of the Opus Audio Codec for Spatial Audio Content in Virtual Reality
Preview
Ben Lee
Tomasz Rudzki
Gavin Kearney
Journal of the Audio Engineering Society, 2023 April - Volume 71 Number 4 (2023)
Twenty-Five Years of Evolution in Speech and Language Processing
Preview
Michael Picheny
Dilek Hakkani-Tur
IEEE Signal Processing Magazine, 40 (2023), pp. 27-39
A High-rate Extension to SoundStream
Andrew Storus
Hong-Goo Kang
Yero Yeh
2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) (2023)
Preview abstract
In this paper, we propose a high-rate extension of the SoundStream codec, which is able to generate almost transparent quality audio at 16 kbps for wideband speech signals. SoundStream shows reasonably good performance at low bit-rates (e.g. around 9 kbps), but its performance does not improve much when using more bits for encoding the latent embeddings. Motivated by experimental results showing that neural audio codec performance is highly related to the characteristics of latent embeddings such as dimensionality, dependency, and probability density function shape, we propose a convolutional transformer architecture and an attention-based multi-scale latent decomposition method that significantly enhances codec performance when quantizing high-dimensional embeddings.
Experimental results show the superiority of our proposed model over conventional approaches.
View details
LMCODEC: A LOW BITRATE SPEECH CODEC WITH CAUSAL TRANSFORMER MODELS
Bastiaan Kleijn
Michael Chinen
Neil Zeghidour
Teerapat Jenrungrot
ICASSP 2023 (2023)
Preview abstract
We introduce LMCodec, a fully-causal neural speech codec that provides high quality at very low bitrates. The backbone of the system is a causal convolutional codec that encodes audio into a hierarchy of coarse-to-fine tokens using residual vector quantization. LMCodec first trains a Transformer language model to predict the fine tokens from the coarse ones in a generative fashion, allowing for the transmission of fewer codes. A second Transformer predicts the uncertainty of the next codes given the past transmitted codes, and is used to perform conditional entropy coding. A MUSHRA subjective test was conducted and shows that the quality is comparable to reference codecs at higher bitrates. Example audio is available at https://google.github.io/chrome-media-audio-papers/publications/lmcodec.
View details
Preview abstract
Speech coding facilitates the transmission of speech over low-bandwidth
networks with minimal distortion. Neural-network based speech codecs
have recently demonstrated significant improvements in performance over
traditional approaches. While this new generation of codecs is capable
of synthesizing high-fidelity speech, their use of recurrent or
convolutional layers often restricts their effective receptive fields,
which prevents them from compressing speech efficiently. We propose to
further reduce the bitrate of neural speech codecs through the use of
pretrained Transformers, capable of exploiting long-range dependencies
in the input signal due to their inductive bias. Our numerical
experiments show that supplementing the encoder of a neural speech codec
with Transformer speech embeddings yields a speech codec with a bitrate
of $600\,\mathrm{bps}$ that outperforms the original neural speech codec
in synthesized speech quality when trained at the same bitrate. The
subjective human evaluations also suggest that the perceived quality of
the resulting codec is comparable or better than that of conventional
codecs operating at 3--4 times the rate.
View details