Jan Skoglund

Jan Skoglund

Jan Skoglund received his Ph.D. degree from Chalmers University of Technology, Sweden. From 1999 to 2000, he worked on low bit rate speech coding at AT&T Labs-Research, Florham Park, NJ. He was with Global IP Solutions (GIPS), San Francisco, CA, from 2000 to 2011 working on speech and audio processing tailored for packet-switched networks. GIPS' audio and video technology was found in many deployments by, e.g., IBM, Google, Yahoo, WebEx, Skype, and Samsung. Since a 2011 acquisition of GIPS he has been a part of Chrome at Google, Inc. He leads a team in San Francisco, CA, developing speech and audio signal processing components for capture, real-time communication, storage, and rendering.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Neural Speech and Audio Coding
    Minje Kim
    IEEE Signal Processing Magazine (2024) (to appear)
    Preview abstract This paper explores the integration of model-based and data-driven approaches within the realm of neural speech and audio coding systems. It highlights the challenges posed by the subjective evaluation processes of speech and audio codecs and discusses the limitations of purely data-driven approaches, which often require inefficiently large architectures to match the performance of model-based methods. The study presents hybrid systems as a viable solution, offering significant improvements to the performance of conventional codecs through meticulously chosen design enhancements. Specifically, it introduces a neural network-based signal enhancer designed to post-process existing codecs’ output, along with the autoencoder-based end-to-end models and LPCNet—hybrid systems that combine linear predictive coding (LPC) with neural networks. Furthermore, the paper delves into predictive models operating within custom feature spaces (TF-Codec) or predefined transform domains (MDCTNet) and examines the use of psychoacoustically calibrated loss functions to train end-to-end neural audio codecs. Through these investigations, the paper demonstrates the potential of hybrid systems to advance the field of speech and audio coding by bridging the gap between traditional model-based approaches and modern data-driven techniques. View details
    SCOREQ: Speech Quality Assessment with Contrastive Regression
    Alessandro Ragano
    Andrew Hines
    NeurIPS (2024) (to appear)
    Preview abstract In this paper, we present SCOREQ, a novel approach for speech quality prediction. SCOREQ is a triplet loss function for contrastive regression that addresses the domain generalisation shortcoming exhibited by state of the art no-reference speech quality metrics. In the paper we: (i) illustrate the problem of L2 loss training failing at capturing the continuous nature of the mean opinion score (MOS) labels; (ii) demonstrate the lack of generalisation through a benchmarking evaluation across several speech domains; (iii) outline our approach and explore the impact of the architectural design decisions through incremental evaluation; (iv) evaluate the final model against state of the art models for a wide variety of data and domains. The results show that the lack of generalisation observed in state of the art speech quality metrics is addressed by SCOREQ. We conclude that using a triplet loss function View details
    Preview abstract This paper presents NOMAD (Non-Matching Audio Distance), a differentiable perceptual similarity metric that measures the distance of a degraded signal against non-matching references. The proposed method is based on learning deep feature embeddings via a triplet loss guided by the Neurogram Similarity Index Measure (NSIM) to capture degradation intensity. During inference, the similarity score between any two audio samples is computed through Euclidean distance of their embeddings. NOMAD is fully unsupervised and can be used in general perceptual audio tasks for audio analysis e.g. quality assessment and generative tasks such as speech enhancement and speech synthesis. The proposed method is evaluated with 3 tasks. Ranking degradation intensity, predicting speech quality, and as a loss function for speech enhancement. Results indicate NOMAD outperforms other non-matching reference approaches in both ranking degradation intensity and quality assessment, exhibiting competitive performance with full-reference audio metrics. NOMAD demonstrates a promising technique that mimics human capabilities in assessing audio quality with non-matching references to learn perceptual embeddings without the need for human-generated labels. View details
    MULTI-CHANNEL AUDIO SIGNAL GENERATION
    W. Bastiaan Kleijn
    Michael Chinen
    ICASSP 2023 (2023)
    Preview abstract We present a multi-channel audio signal generation scheme based on machine-learning and probabilistic modeling. We start from modeling a multi-channel single-source signal. Such signals are naturally modeled as a single-channel reference signal and a spatial-arrangement (SA) model specified by an SA parameter sequence.We focus on the SA model and assume that the reference signal is described by some parameter sequence. The SA model parameters are described with a learned probability distribution that is conditioned by the reference-signal parameter sequence and, optionally, an SA conditioning sequence. If present, the SA conditioning sequence specifies a signal class or a specific signal. The single-source method can be used for multi-source signals by applying source separation or by using an SA model that operates on non-overlapping frequency bands. Our GAN-based stereo coding implementation of the latter approach shows that our paradigm facilitates plausible high-quality rendering at a low bit rate for the SA conditioning. View details
    LMCODEC: A LOW BITRATE SPEECH CODEC WITH CAUSAL TRANSFORMER MODELS
    Bastiaan Kleijn
    Michael Chinen
    Neil Zeghidour
    Teerapat Jenrungrot
    ICASSP 2023 (2023)
    Preview abstract We introduce LMCodec, a fully-causal neural speech codec that provides high quality at very low bitrates. The backbone of the system is a causal convolutional codec that encodes audio into a hierarchy of coarse-to-fine tokens using residual vector quantization. LMCodec first trains a Transformer language model to predict the fine tokens from the coarse ones in a generative fashion, allowing for the transmission of fewer codes. A second Transformer predicts the uncertainty of the next codes given the past transmitted codes, and is used to perform conditional entropy coding. A MUSHRA subjective test was conducted and shows that the quality is comparable to reference codecs at higher bitrates. Example audio is available at https://google.github.io/chrome-media-audio-papers/publications/lmcodec. View details
    Context-Based Evaluation of the Opus Audio Codec for Spatial Audio Content in Virtual Reality
    Ben Lee
    Tomasz Rudzki
    Gavin Kearney
    Journal of the Audio Engineering Society, 2023 April - Volume 71 Number 4 (2023)
    Preview
    A High-rate Extension to SoundStream
    Andrew Storus
    Hong-Goo Kang
    Yero Yeh
    2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) (2023)
    Preview abstract In this paper, we propose a high-rate extension of the SoundStream codec, which is able to generate almost transparent quality audio at 16 kbps for wideband speech signals. SoundStream shows reasonably good performance at low bit-rates (e.g. around 9 kbps), but its performance does not improve much when using more bits for encoding the latent embeddings. Motivated by experimental results showing that neural audio codec performance is highly related to the characteristics of latent embeddings such as dimensionality, dependency, and probability density function shape, we propose a convolutional transformer architecture and an attention-based multi-scale latent decomposition method that significantly enhances codec performance when quantizing high-dimensional embeddings. Experimental results show the superiority of our proposed model over conventional approaches. View details
    Twenty-Five Years of Evolution in Speech and Language Processing
    Michael Picheny
    Dilek Hakkani-Tur
    IEEE Signal Processing Magazine, 40 (2023), pp. 27-39
    Preview
    Convolutional Transformer for Neural Speech Coding
    Hong-Goo Kang
    Bastiaan Kleijn
    Michael Chinen
    Audio Engineering Society Convention 155 (2023)
    Preview abstract In this paper, we propose a Convolutional-Transformer speech codec (ConvT-SC) which utilizes stacks of convolutions and self-attention layers to remove redundant information at the downsampling and upsampling blocks of a U-Net-style encoder-decoder neural codec architecture. We design the Transformers to use channel and temporal attention with any number of attention stages and heads while maintaining causality. This allows us to take into consideration the characteristics of the input vectors and flexibly utilize temporal and channel-wise relationships at different scales when encoding the salient information that is present in speech. This enables our model to reduce the dimensionality of its latent embeddings and improve its quantization efficiency while maintaining quality. Experimental results demonstrate that our approach achieves significantly better performance than convolution-only baselines. View details
    Using Rater and System Metadata to Explain Variance in the VoiceMOS Dataset
    Alessandro Ragano
    Andrew Hines
    Chandan K. Reddy
    Michael Chinen
    Interspeech 2022
    Preview abstract Non-reference speech quality models are important for a growing number of applications. The VoiceMOS 2022 challenge provided a dataset of synthetic voice conversion and text-to-speech samples with subjective labels. This study looks at the amount of variance that can be explained in subjective ratings of speech quality from metadata and the distribution imbalances of the dataset. Speech quality models were constructed using wav2vec 2.0 with additional metadata features that included rater groups and system identifiers and obtained competitive metrics including a Spearman rank correlation coefficient (SRCC) of 0.934 and MSE of 0.088 at the system-level, and 0.877 and 0.198 at the utterance-level. Using data and metadata that the test restricted or blinded further improved the metrics. A metadata analysis showed that the system-level metrics do not represent the model's system-level prediction as a result of the wide variation in the number of utterances used for each system on the validation and test datasets. We conclude that, in general, conditions should have enough utterances in the test set to bound the sample mean error, and be relatively balanced in utterance count between systems, otherwise the utterance-level metrics may be more reliable and interpretable. View details