Jan Skoglund

Jan Skoglund

Jan Skoglund received his Ph.D. degree from Chalmers University of Technology, Sweden. From 1999 to 2000, he worked on low bit rate speech coding at AT&T Labs-Research, Florham Park, NJ. He was with Global IP Solutions (GIPS), San Francisco, CA, from 2000 to 2011 working on speech and audio processing tailored for packet-switched networks. GIPS' audio and video technology was found in many deployments by, e.g., IBM, Google, Yahoo, WebEx, Skype, and Samsung. Since a 2011 acquisition of GIPS he has been a part of Chrome at Google, Inc. He leads a team in San Francisco, CA, developing speech and audio signal processing components for capture, real-time communication, storage, and rendering.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract This paper presents NOMAD (Non-Matching Audio Distance), a differentiable perceptual similarity metric that measures the distance of a degraded signal against non-matching references. The proposed method is based on learning deep feature embeddings via a triplet loss guided by the Neurogram Similarity Index Measure (NSIM) to capture degradation intensity. During inference, the similarity score between any two audio samples is computed through Euclidean distance of their embeddings. NOMAD is fully unsupervised and can be used in general perceptual audio tasks for audio analysis e.g. quality assessment and generative tasks such as speech enhancement and speech synthesis. The proposed method is evaluated with 3 tasks. Ranking degradation intensity, predicting speech quality, and as a loss function for speech enhancement. Results indicate NOMAD outperforms other non-matching reference approaches in both ranking degradation intensity and quality assessment, exhibiting competitive performance with full-reference audio metrics. NOMAD demonstrates a promising technique that mimics human capabilities in assessing audio quality with non-matching references to learn perceptual embeddings without the need for human-generated labels. View details
    Twenty-Five Years of Evolution in Speech and Language Processing
    Michael Picheny
    Dilek Hakkani-Tur
    IEEE Signal Processing Magazine, 40(2023), pp. 27-39
    Preview
    MULTI-CHANNEL AUDIO SIGNAL GENERATION
    W. Bastiaan Kleijn
    Michael Chinen
    ICASSP 2023(2023)
    Preview abstract We present a multi-channel audio signal generation scheme based on machine-learning and probabilistic modeling. We start from modeling a multi-channel single-source signal. Such signals are naturally modeled as a single-channel reference signal and a spatial-arrangement (SA) model specified by an SA parameter sequence.We focus on the SA model and assume that the reference signal is described by some parameter sequence. The SA model parameters are described with a learned probability distribution that is conditioned by the reference-signal parameter sequence and, optionally, an SA conditioning sequence. If present, the SA conditioning sequence specifies a signal class or a specific signal. The single-source method can be used for multi-source signals by applying source separation or by using an SA model that operates on non-overlapping frequency bands. Our GAN-based stereo coding implementation of the latter approach shows that our paradigm facilitates plausible high-quality rendering at a low bit rate for the SA conditioning. View details
    Context-Based Evaluation of the Opus Audio Codec for Spatial Audio Content in Virtual Reality
    Ben Lee
    Tomasz Rudzki
    Gavin Kearney
    Journal of the Audio Engineering Society, 2023 April - Volume 71 Number 4(2023)
    Preview
    Convolutional Transformer for Neural Speech Coding
    Hong-Goo Kang
    Bastiaan Kleijn
    Michael Chinen
    Audio Engineering Society Convention 155(2023)
    Preview abstract In this paper, we propose a Convolutional-Transformer speech codec (ConvT-SC) which utilizes stacks of convolutions and self-attention layers to remove redundant information at the downsampling and upsampling blocks of a U-Net-style encoder-decoder neural codec architecture. We design the Transformers to use channel and temporal attention with any number of attention stages and heads while maintaining causality. This allows us to take into consideration the characteristics of the input vectors and flexibly utilize temporal and channel-wise relationships at different scales when encoding the salient information that is present in speech. This enables our model to reduce the dimensionality of its latent embeddings and improve its quantization efficiency while maintaining quality. Experimental results demonstrate that our approach achieves significantly better performance than convolution-only baselines. View details
    LMCODEC: A LOW BITRATE SPEECH CODEC WITH CAUSAL TRANSFORMER MODELS
    Bastiaan Kleijn
    Michael Chinen
    Neil Zeghidour
    Teerapat Jenrungrot
    ICASSP 2023(2023)
    Preview abstract We introduce LMCodec, a fully-causal neural speech codec that provides high quality at very low bitrates. The backbone of the system is a causal convolutional codec that encodes audio into a hierarchy of coarse-to-fine tokens using residual vector quantization. LMCodec first trains a Transformer language model to predict the fine tokens from the coarse ones in a generative fashion, allowing for the transmission of fewer codes. A second Transformer predicts the uncertainty of the next codes given the past transmitted codes, and is used to perform conditional entropy coding. A MUSHRA subjective test was conducted and shows that the quality is comparable to reference codecs at higher bitrates. Example audio is available at https://google.github.io/chrome-media-audio-papers/publications/lmcodec. View details
    A High-rate Extension to SoundStream
    Andrew Storus
    Hong-Goo Kang
    Yero Yeh
    2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)(2023)
    Preview abstract In this paper, we propose a high-rate extension of the SoundStream codec, which is able to generate almost transparent quality audio at 16 kbps for wideband speech signals. SoundStream shows reasonably good performance at low bit-rates (e.g. around 9 kbps), but its performance does not improve much when using more bits for encoding the latent embeddings. Motivated by experimental results showing that neural audio codec performance is highly related to the characteristics of latent embeddings such as dimensionality, dependency, and probability density function shape, we propose a convolutional transformer architecture and an attention-based multi-scale latent decomposition method that significantly enhances codec performance when quantizing high-dimensional embeddings. Experimental results show the superiority of our proposed model over conventional approaches. View details
    Ultra Low-Bitrate Speech Coding with Pretrained Transformers
    Ali Siakoohi
    Bastiaan Kleijn
    Michael Chinen
    Tom Denton
    Interspeech 2022
    Preview abstract Speech coding facilitates the transmission of speech over low-bandwidth networks with minimal distortion. Neural-network based speech codecs have recently demonstrated significant improvements in performance over traditional approaches. While this new generation of codecs is capable of synthesizing high-fidelity speech, their use of recurrent or convolutional layers often restricts their effective receptive fields, which prevents them from compressing speech efficiently. We propose to further reduce the bitrate of neural speech codecs through the use of pretrained Transformers, capable of exploiting long-range dependencies in the input signal due to their inductive bias. Our numerical experiments show that supplementing the encoder of a neural speech codec with Transformer speech embeddings yields a speech codec with a bitrate of $600\,\mathrm{bps}$ that outperforms the original neural speech codec in synthesized speech quality when trained at the same bitrate. The subjective human evaluations also suggest that the perceived quality of the resulting codec is comparable or better than that of conventional codecs operating at 3--4 times the rate. View details
    Using Rater and System Metadata to Explain Variance in the VoiceMOS Dataset
    Alessandro Ragano
    Andrew Hines
    Chandan K. Reddy
    Michael Chinen
    Interspeech 2022
    Preview abstract Non-reference speech quality models are important for a growing number of applications. The VoiceMOS 2022 challenge provided a dataset of synthetic voice conversion and text-to-speech samples with subjective labels. This study looks at the amount of variance that can be explained in subjective ratings of speech quality from metadata and the distribution imbalances of the dataset. Speech quality models were constructed using wav2vec 2.0 with additional metadata features that included rater groups and system identifiers and obtained competitive metrics including a Spearman rank correlation coefficient (SRCC) of 0.934 and MSE of 0.088 at the system-level, and 0.877 and 0.198 at the utterance-level. Using data and metadata that the test restricted or blinded further improved the metrics. A metadata analysis showed that the system-level metrics do not represent the model's system-level prediction as a result of the wide variation in the number of utterances used for each system on the validation and test datasets. We conclude that, in general, conditions should have enough utterances in the test set to bound the sample mean error, and be relatively balanced in utterance count between systems, otherwise the utterance-level metrics may be more reliable and interpretable. View details
    Partial Monotonic Speech Quality Estimation in ViSQOL with Deep Lattice Networks
    Andrew Hines
    Michael Chinen
    Journal of the Acoustical Society of America, 149(2021), pp. 3851-3861
    Preview abstract When predicting subjective quality as mean opinion score (MOS) for speech, a raw similarity score is often mapped onto the score dimension with a mapping function. Virtual Speech Quality Objective Listener (ViSQOL) uses monotonic one-dimensional mappings to evaluate speech. More recent models such as support vector regression (SVR) or deep neural networks (DNNs) use multidimensional input, which allows for a more accurate prediction, but do not provide the monotonic property that is expected. We propose to integrate a multi-dimensional mapping function using deep lattice networks (DLNs) into ViSQOL. DLNs also provide some insight into model interpretation and are robust to overfitting, leading to better out-of-sample performance. With the DLN, ViSQOL improved the speech mapping from the previous exponential mapping's .58 MSE to .24 MSE on a mixture of datasets, outperforming the 1-D fitted functions, SVR, as well as PESQ and POLQA. Additionally, we show that the DLN can be used to learn a quantile function that is well calibrated and a useful measure of uncertainty. With this quantile function, the model is able to provide useful quantile intervals for predictions instead of point intervals. View details