Jump to Content
Matthias Grundmann

Matthias Grundmann

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract We present StreamVC, a streaming voice conversion solution that preserves the content and prosody of any source speech while matching the voice timbre from any target speech. Unlike previous approaches, StreamVC produces the resulting waveform at low latency from the input signal even on a mobile platform, making it applicable to real-time communication scenarios like calls and video conferencing, and addressing use cases such as voice anonymization in these scenarios. Our design leverages the architecture and training strategy of the SoundStream neural audio codec for lightweight high-quality speech synthesis. We demonstrate the feasibility of learning soft speech units causally, as well as the effectiveness of supplying whitened fundamental frequency information to improve pitch stability without leaking the source timbre information. View details
    Preview abstract We propose a neural network model that can separate target speech sources from interfering sources at different angular regions using two microphones. The model is trained with simulated room impulse responses (RIRs) using omni-directional microphones without needing to collect real RIRs. By relying on specific angular regions and multiple room simulations, the model utilizes consistent time difference of arrival (TDOA) cues, or what we call delay contrast, to separate target and interference sources while remaining robust in various reverberation environments. We demonstrate the model is not only generalizable to a commercially available device with a slightly different microphone geometry, but also outperforms our previous work which uses one additional microphone on the same device. The model runs in real-time on-device and is suitable for low-latency streaming applications such as telephony and video conferencing. View details
    Semi-Implicit Denoising Diffusion Models (SIDDMs)
    Yanwu Xu
    Mingming Gong
    Shaoan Xie
    Wei Wei
    Kayhan Batmanghelich
    Tingbo Hou
    NeurIPS (2023) (to appear)
    Preview abstract Despite the proliferation of generative models, achieving fast sampling during inference without compromising sample diversity and quality remains challenging. Existing models such as Denoising Diffusion Probabilistic Models (DDPM) deliver high-quality, diverse samples but are slowed by an inherently high number of iterative steps. The Denoising Diffusion Generative Adversarial Networks (DDGAN) attempted to circumvent this limitation by integrating a GAN model for larger jumps in the diffusion process. However, DDGAN encountered scalability limitations when applied to large datasets. To address these limitations, we introduce a novel approach that tackles the problem by matching implicit and explicit factors. More specifically, our approach involves utilizing an implicit model to match the marginal distributions of noisy data and the explicit conditional distribution of the forward diffusion. This combination allows us to effectively match the joint denoising distributions. Unlike DDPM but similar to DDGAN, we do not enforce a parametric distribution for the reverse step, enabling us to take large steps during inference. Similar to the DDPM but unlike DDGAN, we take advantage of the exact form of the diffusion process. We demonstrate that our proposed method obtains comparable generative performance to diffusion-based models and vastly superior results to models with a small number of sampling steps. View details
    Preview abstract StyleGAN models have been widely adopted for generating and editing face images. Yet, few work investigated running StyleGAN models on mobile devices. In this work, we introduce BlazeStyleGAN --- to the best of our knowledge, the first StyleGAN model that can run in real-time on smartphones. We design an efficient synthesis network with the auxiliary head to convert features to RGB at each level of the generator, and only keep the last one at inference. We also improve the distillation strategy with a multi-scale perceptual loss using the auxiliary heads, and an adversarial loss for the student generator and discriminator. With these optimizations, BlazeStyleGAN can achieve real-time performance on high-end mobile GPUs. Experimental results demonstrate that BlazeStyleGAN generates high-quality face images and even mitigates some artifacts from the teacher model. View details
    Guided Speech Enhancement Network
    Yang Yang
    Hakan Erdogan
    Jamie Lin
    ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
    Preview abstract High quality speech capture has been widely studied for both voice communication and human computer interface reasons. To improve the capture performance, we can often find multi-microphone speech enhancement techniques deployed on various devices. Multi-microphone speech enhancement problem is often decomposed into two decoupled steps: a beamformer that provides spatial filtering and a single-channel speech enhancement model that cleans up the beamformer output. In this work, we propose a speech enhancement solution that takes both the raw microphone and beamformer outputs as the input for an ML model. We devise a simple yet effective training scheme that allows the model to learn from the cues of the beamformer by contrasting the two inputs and greatly boost its capability in spatial rejection, while conducting the general tasks of denoising and dereverberation. The proposed solution takes advantage of classical spatial filtering algorithms instead of competing with them. By design, the beamformer module then could be selected separately and does not require a large amount of data to be optimized for a given form factor, and the network model can be considered as a standalone module which is highly transferable independently from the microphone array. We name the ML module in our solution as GSENet, short for Guided Speech Enhancement Network. We demonstrate its effectiveness on real world data collected on multi-microphone devices in terms of the suppression of noise and interfering speech. View details
    Preview abstract An authentic face restoration system is becoming increasingly demanding in many computer vision applications, e.g., image enhancement, video communication, and taking portrait. Most of the advanced face restoration models can recover high-quality faces from low-quality ones but usually fail to faithfully generate realistic and high-frequency details that are favored by users. To achieve authentic restoration, we propose IDM, an Iteratively learned face restoration system based on denoising Diffusion Models (DDMs). We define the criterion of an authentic face restoration system, and argue that denoising diffusion models are naturally endowed with this property from two aspects: intrinsic iterative refinement and extrinsic iterative enhancement. Intrinsic learning can preserve the content well and gradually refine the high-quality details, while extrinsic enhancement helps clean the data and improve the restoration task one step further. We demonstrate superior performance on blind face restoration tasks. Beyond restoration, we find the authentically cleaned data by the proposed restoration system is also helpful to image generation tasks in terms of training stabilization and sample quality. Without modifying the models, we achieve better quality than state-of-the-art on FFHQ and ImageNet generation using either GANs or diffusion models. View details
    Efficient Heterogeneous Video Segmentation at the Edge
    Jamie Lin
    Siargey Pisarchyk
    David Cong Tian
    Tingbo Hou
    Sixth Workshop on Computer Vision for AR/VR (CV4ARVR) (2022)
    Preview abstract We introduce an efficient video segmentation system for resource-limited edge devices leveraging heterogeneous compute. Specifically, we design network models by searching across multiple dimensions of specifications for the neural architectures and operations on top of already light-weight backbones, targeting commercially available edge inference engines. We further analyze and optimize the heterogeneous data flows in our systems across the CPU, the GPU and the NPU. Our approach has empirically factored well into our real-time AR system, enabling remarkably higher accuracy with quadrupled effective resolutions, yet at much shorter end-to-end latency, much higher frame rate, and even lower power consumption on edge platforms. View details
    On-device Real-time Hand Gesture Recognition
    Chuo-Ling Chang
    Esha Uboweja
    Kanstantsin Sokal
    Valentin Bazarevsky
    ICCV Workshop on Computer Vision for Augmented and Virtual Reality, Montreal, Canada, 2021 (2021)
    Preview abstract We present an on-device real-time hand gesture recogni-tion (HGR) system, which detects a set of predefined staticgestures from a single RGB camera. The system consists oftwo parts: a hand skeleton tracker and a gesture classifier.We improve and extend MediaPipe Hands [12] for the handtracker. We experiment with two different gesture classifiers,one heuristics based and one neural network (NN) based. View details
    Instant 3D Object Tracking with Application in Augmented Reality
    Adel Ahmadyan
    Artsiom Ablavatski
    Tingbo Hou
    CVPR Fourth Workshop on Computer Vision for AR/VR (2020)
    Preview abstract Tracking object poses in 3D is an important technology in augmented reality applications. We propose an instant motion tracking system that tracks the object's pose (3D bounding box) in real-time on mobile devices. Our system does not require any prior sensory calibration or initialization sequence to perform. Objects are detected and their initial 3D pose is estimated using a deep neural network. Then the estimated pose is tracked using a robust planar tracker. Our tracker is capable of performing relative-scale 6-DoF tracking in real-time on mobile devices. By combining CPU and GPU usage efficiently, we get 25-FPS+ performance on mobile devices. View details
    Preview abstract We present Attention Mesh, a lightweight architecture for 3D face mesh prediction that uses attention to semantically meaningful regions. Our neural network is designed for real-time on-device inference and runs at over 50 FPS on a Pixel 2 phone. This solution enables applications like AR makeup, eye tracking and puppeteering that rely on highly accurate landmarks for eye and lips regions. Our main contribution is a unified network architecture that achieves the same accuracy on facial landmarks as a multi-stage cascaded approach, while being 30 percent faster. View details
    MediaPipe Hands: On-device Real-time Hand Tracking
    Andrey Vakunov
    Chuo-Ling Chang
    Fan Zhang
    Valentin Bazarevsky
    CV4ARVR 2020 (2020)
    Preview abstract We present a real-time on-device hand tracking pipeline that predicts hand skeleton from only single camera input for AR/VR applications. The pipeline consists of two models: 1) a palm detector, 2) a hand landmark prediction. It's implemented via MediaPipe which is a cross-platform ML pipeline. The proposed architecture demonstrates realtime inference speed on mobile GPUs and a high prediction quality. MediaPipe Hands is open-sourced at https://github.com/google/mediapipe. View details
    Instant Motion Tracking and Its Applications to Augmented Reality
    Tyler Randall Mullen
    Adel Ahmadyan
    Tingbo Hou
    CVPR Workshop on Computer Vision for Augmented and Virtual Reality 2019, IEEE, Long Beach, CA
    Preview abstract Augmented Reality (AR) brings immersive experiences to users. With recent advances in computer vision and mobile computing, AR has scaled across platforms, and increased adoption in major products. A critical component of AR is to understand and track real environments. In this paper, we present a system for motion tracking, which is capable of robustly tracking planar targets and performing relative-scale 6DoF tracking without calibration. Our system runs in real-time on mobile and has been deployed in multiple major products on hundreds of millions of devices. View details
    On-Device Neural Net Inference with Mobile GPUs
    Nikolay Chirkov
    Yury Pisarchyk
    Mogan Shieh
    Fabio Riccardi
    Efficient Deep Learning for Computer Vision CVPR 2019 (ECV2019) (to appear)
    Preview abstract On-device inference of machine learning models for mobile phones is desirable due to its lower latency and increased privacy. Running such a compute-intensive task solely on the mobile CPU, however, can be difficult due to limited computing power, thermal constraints, and energy consumption. App developers and researchers have begun exploiting hardware accelerators to overcome these challenges. Recently, device manufacturers are adding neural processing units into high-end phones for on-device inference, but these account for only a small fraction of hand-held devices. In this paper, we present how we leverage the mobile GPU, a ubiquitous hardware accelerator on virtually every phone, to run inference of deep neural networks in real-time for both Android and iOS devices. By describing our architecture, we also discuss how to design networks that are mobile GPU-friendly. Our state-of-the-art mobile GPU inference engine is integrated into the open-source project TensorFlow Lite and publicly available at https://tensorflow.org/lite. View details
    MediaPipe: A Framework for Perceiving and Processing Reality
    Camillo Lugaresi
    Jiuqiang Tang
    Hadon Nash
    Chris McClanahan
    Esha Uboweja
    Michael Hays
    Fan Zhang
    Chuo-Ling Chang
    Ming Yong
    Wan-Teh Chang
    Wei Hua
    Manfred Georg
    Third Workshop on Computer Vision for AR/VR at IEEE Computer Vision and Pattern Recognition (CVPR) 2019
    Preview abstract Building an application that processes perceptual inputs involves more than running an ML model. Developers have to harness the capabilities of a wide range of devices; balance resource usage and quality of results; run multiple operations in parallel and with pipelining; and ensure that time-series data is properly synchronized. The MediaPipe framework addresses these challenges. A developer can use MediaPipe to easily and rapidly combine existing and new perception components into prototypes and advance them to polished cross-platform applications. The developer can configure an application built with MediaPipe to manage resources efficiently (both CPU and GPU) for low latency performance, to handle synchronization of time-series data such as audio and video frames and to measure performance and resource consumption. We show that these features enable a developer to focus on the algorithm or model development, and use MediaPipe as an environment for iteratively improving their application, with results reproducible across different devices and platforms. MediaPipe will be open-sourced at https://github.com/google/mediapipe. View details
    BlazeFace: Sub-millisecond Neural Face Detection on Mobile GPUs
    Valentin Bazarevsky
    Andrey Vakunov
    CVPR Workshop on Computer Vision for Augmented and Virtual Reality 2019, IEEE, Long Beach, CA (2019)
    Preview abstract We present BlazeFace, a lightweight and well-performing face detector tailored for mobile GPU inference. It runs at a speed of 200--1000+ FPS on flagship devices. This super-realtime performance enables it to be applied to any augmented reality pipeline that requires an accurate facial region of interest as an input for task-specific models, such as 2D/3D facial keypoint or geometry estimation, facial features or expression classification, and face region segmentation. Our contributions include a lightweight feature extraction network inspired by, but distinct from MobileNetV1/V2, a GPU-friendly anchor scheme modified from Single Shot MultiBox Detector (SSD), and an improved tie resolution strategy alternative to non-maximum suppression. View details
    Real-time Facial Surface Geometry from Monocular Video on Mobile GPUs
    Artsiom Ablavatski
    CVPR Workshop on Computer Vision for Augmented and Virtual Reality 2019, IEEE, Long Beach, CA
    Preview abstract We present an end-to-end neural network-based model for inferring the approximate 3D mesh representation of a human face from single camera input for AR applications. The relatively dense mesh model of 468 vertices is well-suited for face-based AR effects. The proposed model demonstrates super-realtime inference speed on mobile GPUs (100--1000+ FPS, depending on the device and model variant) and a high prediction quality that is comparable to the variance in manual annotations of the same image. View details
    Weakly Supervised Learning of Object Segmentations from Web-Scale Video
    Glenn Hartmann
    Judy Hoffman
    David Tsai
    Omid Madani
    James Rehg
    Rahul Sukthankar
    ECCV'12 Proceedings of the 12th international conference on Computer Vision - Volume Part I, Springer-Verlag, Berlin, Heidelberg (2012), pp. 198-208
    Preview abstract We propose to learn pixel-level segmentations of objects from weakly labeled (tagged) internet videos. Specifically, given a large collection of raw YouTube content, along with potentially noisy tags, our goal is to automatically generate spatiotemporal masks for each object, such as "dog", without employing any pre-trained object detectors. We formulate this problem as learning weakly supervised classifiers for a set of independent spatio-temporal segments. The object seeds obtained using segment-level classifiers are further refined using graphcuts to generate high-precision object masks. Our results, obtained by training on a dataset of 20,000 YouTube videos weakly tagged into 15 classes, demonstrate automatic extraction of pixel-level object masks. Evaluated against a ground-truthed subset of 50,000 frames with pixel-level annotations, we confirm that our proposed methods can learn good object masks just by watching YouTube. View details
    Calibration-Free Rolling Shutter Removal
    Daniel Castro
    International Conference on Computational Photography [Best Paper], IEEE (2012)
    Preview abstract We present a novel algorithm for efficient removal of rolling shutter distortions in uncalibrated streaming videos. Our proposed method is calibration free as it does not need any knowledge of the camera used, nor does it require calibration using specially recorded calibration sequences. Our algorithm can perform rolling shutter removal under varying focal lengths, as in videos from CMOS cameras equipped with an optical zoom. We evaluate our approach across a broad range of cameras and video sequences demonstrating robustness, scaleability, and repeatability. We also conducted a user study, which demonstrates preference for the output of our algorithm over other state-of-the art methods. Our algorithm is computationally efficient, easy to parallelize, and robust to challenging artifacts introduced by various cameras with differing technologies. View details
    Preview abstract We present a novel algorithm for automatically applying constrainable, L1-optimal camera paths to generate stabilized videos by removing undesired motions. Our goal is to compute camera paths that are composed of constant, linear and parabolic segments mimicking the camera motions employed by professional cinematographers. To this end, our algorithm is based on a linear programming framework to minimize the first, second, and third derivatives of the resulting camera path. Our method allows for video stabilization beyond the conventional filtering of camera paths that only suppresses high frequency jitter. We incorporate additional constraints on the path of the camera directly in our algorithm, allowing for stabilized and retargeted videos. Our approach accomplishes this without the need of user interaction or costly 3D reconstruction of the scene, and works as a post-process for videos from any camera or from an online source. View details
    Preview abstract We introduce a new algorithm for video retargeting that uses discontinuous seam-carving in both space and time for resizing videos. We propose a novel appearance-based temporal coherence formulation that allows for frame-by-frame processing and results in temporally discontinuous seams, as opposed to geometrically smooth and continuous seams. This formulation optimizes the difference in appearance of the resultant retargeted frame to the optimal temporally coherent one, and allows for carving around fast moving salient regions. Additionally, we generalize the idea of appearance-based coherence to the spatial domain by introducing piece-wise spatial seams. Our spatial coherence measure minimizes the change in gradients during retargeting, which preserves spatial detail better than minimization of color difference alone. We also show that retargeting based on per-frame saliency (gradient-based or feature-based) does not always lead to desirable results and propose a novel automatically computed measure of spatio-temporal saliency. As needed, the user can also augment the saliency by interactive region-brushing. Our retargeting algorithm processes the video sequentially, which allows us to deal with streaming videos. We demonstrate results over a wide range of video examples and evaluate the effectiveness of each component of our algorithm. View details
    Preview abstract We present an efficient and scalable technique for spatio-temporal segmentation of long video sequences using a hierarchical graph-based algorithm. We begin by over-segmenting a volumetric video graph into space-time regions grouped by appearance. We then construct a ``region graph" over the obtained segmentation and iteratively repeat this process over multiple levels to create a tree of spatio-temporal segmentations. This hierarchical approach generates high quality segmentations which are temporally coherent with stable region boundaries. Additionally, the resulting segmentation hierarchy allows subsequent applications to choose from varying levels of granularity. We further improve segmentation quality by using dense optical flow when constructing the initial graph. We also propose two novel approaches to improve the scalability of our technique: (a) a parallel out-of-core algorithm that can process volumes much larger than an in-core algorithm, and (b) a clip-based processing algorithm that divides the video into overlapping clips in time, and segments them successively while enforcing consistency. We can segment video shots as long as 40 seconds without compromising quality, and even support a streaming mode for arbitrarily long videos, albeit without the ability to process them hierarchically. View details
    Post-processing Approach for Radiometric Self-Calibration of Video
    Chris McClanahan
    Sing Bing Kang
    International Conference on Computational Photography, IEEE (2013)
    Motion Fields to Predict Play Evolution in Dynamic Sport Scenes
    Kihwan Kim
    Ariel Shamir
    Iain Matthews
    Jessica Hodgins
    IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2010)
    3D Shape Context and Distance Transform for Action Recognition
    Franziska Meier
    International Conference on Pattern Recognition (ICPR) (2008)