Jump to Content
Feng Yang

Feng Yang

Feng Yang is a Senior Staff Software Engineer at Google Research, where he leads the AIM Engine team. He was a postdoctoral researcher at Illumination&Imaging Lab, Robotics Institute, CMU, supervised by Prof. Srinivasa Narasimhan. He interned at Intel Research Center, Beijing advised by Eric Li and Nokia Research, Palo Alto, CA advised by Kari Pulli. He was a Research Assistant with the Broadband Network and Digital Multimedia Laboratory, Tsinghua University, advised by Prof. Qionghai Dai, member of Chinese Academy of Engineering and Audiovisual Communications Laboratory, EPFL, advised by Prof. Martin Vetterli, President of EPFL and member the US National Academy of Engineering. His research interests include deep learning, image quality assessment, digital rights management, image and video synthesis, image and video processing, computational photography, video streaming, distributed video coding, sampling theories, and mobile sensing. Most recently, he is working on building the next generation image/video processing, serving and storage system with AI. A lot of his algorithms are landed in Google's products and significantly increased the company's revenue and reduced cost. Feng Yang received the B.Eng. and M.Eng. degrees in automatic control from Tsinghua University, Beijing, China, in 2004 and 2007, respectively. He won Outstanding master thesis of Tsinghua University. He got Ph.D. degree in communication systems at the École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland in 2012. He won Fritz Kutter Award for best PhD thesis. He holds several patents and some of them sold to Rambus Inc.

Research Areas

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    SVDiff: Compact Parameter Space for Diffusion Fine-Tuning
    Ligong Han
    Han Zhang
    Dimitris Metaxas
    IEEE/CVF International Conference on Computer Vision (ICCV) (2023)
    Preview abstract Diffusion models have achieved remarkable success in text-to-image generation, enabling the creation of high-quality images from text prompts or other modalities. However, existing methods for customizing these models are limited by handling multiple personalized subjects and the risk of overfitting. Moreover, their large number of parameters is inefficient for model storage. In this paper, we propose a novel approach to address these limitations in existing text-to-image diffusion models for personalization. Our method involves fine-tuning the singular values of the weight matrices, leading to a compact and efficient parameter space that reduces the risk of overfitting and language-drifting. We also propose a Cut-Mix-Unmix data-augmentation technique to enhance the quality of multi-subject image generation and a simple text-based image editing framework. Our proposed SVDiff method has a significantly smaller model size (1.7MB for StableDiffusion) compared to existing methods (vanilla DreamBooth 3.66GB, Custom Diffusion 73MB), making it more practical for real-world applications. View details
    Preview abstract Video watermarking embeds a message into a cover video in an imperceptible manner, which can be retrieved even if the video undergoes certain modifications or distortions. Traditional watermarking methods are often manually designed for particular types of distortions and thus cannot simultaneously handle a broad spectrum of distortions. To this end, we propose a robust deep learning-based solution for video watermarking that is end-to-end trainable. Our model consists of a novel multiscale design where the watermarks are distributed across multiple spatial-temporal scales. Extensive evaluations on a wide variety of distortions show that our method outperforms traditional video watermarking methods as well as deep image watermarking models by a large margin. We further demonstrate the practicality of our method on a realistic video-editing application. View details
    Preview abstract Digital watermarking is widely used for copyright protection. Traditional 3D watermarking approaches or commercial software are typically designed to embed messages into 3D meshes, and later retrieve the messages directly from distorted/undistorted watermarked 3D meshes. However, in many cases, users only have access to rendered 2D images instead of 3D meshes. Unfortunately, retrieving messages from 2D renderings of 3D meshes is still challenging and underexplored. We introduce a novel end-toend learning framework to solve this problem through: 1) an encoder to covertly embed messages in both mesh geometry and textures; 2) a differentiable renderer to render watermarked 3D objects from different camera angles and under varied lighting conditions; 3) a decoder to recover the messages from 2D rendered images. From our experiments, we show that our model can learn to embed information visually imperceptible to humans, and to retrieve the embedded information from 2D renderings that undergo 3D distortions. In addition, we demonstrate that our method can also work with other renderers, such as ray tracers and real-time renderers with and without fine-tuning. View details
    Preview abstract Transformers have recently gained significant attention in the computer vision community. However, the lack of scalability of self-attention mechanisms with respect to image size has limited their wide adoption in state-of-the-art vision backbones. In this paper we introduce an efficient and scalable attention model we call multi-axis attention, which consists of two aspects: blocked local and dilated global attention. These design choices allow global-local spatial interactions on arbitrary input resolutions with only linear complexity. We also present a new architectural element by effectively blending our proposed attention model with convolutions, and accordingly propose a simple hierarchical vision backbone, dubbed MaxViT, by simply repeating the basic building block over multiple stages. Notably, MaxViT is able to “see” globally throughout the entire network, even in earlier, high-resolution stages. We demonstrate the effectiveness of our model on a broad spectrum of vision tasks. On image classification, MaxViT achieves state-of-the-art performance under various settings: without extra data, MaxViT attains 86.5% ImageNet-1K top-1 accuracy; with ImageNet-21K pre-training, our model achieves 88.7% top-1 accuracy. For downstream tasks, MaxViT as a backbone delivers favorable performance on object detection as well as visual aesthetic assessment. We also show that our proposed model expresses strong generative modeling capability on ImageNet, demonstrating the superior potential of MaxViT blocks as a universal vision module. The source code and trained models will be available at https://github.com/google-research/maxvit. View details
    MAXIM: Multi-Axis MLP for Image Processing
    Han Zhang
    Alan Bovik
    IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022)
    Preview abstract Recent progress on Transformers and MLP-like models has shown new architecture design paradigms on many computer vision tasks. However, efficacy and efficiency of these models for low-level vision tasks have not been studied extensively. In this paper, we present MAXIM, a general image processing architecture with multi-axis gated MLPs, to advance the possibility of global operators for low-level vision. Our single-stage MAXIM backbone shares a UNet-shaped hierarchy structure and enjoys a long-range interaction brought by spatial-gated MLPs. Specifically, MAXIM contains two MLP-based building blocks. First, we devise a multi-axis gated MLP that allows efficient and scalable spatial mixing of local and global information. Second, we propose a cross-gating block, an alternative to cross-attention, which accounts for cross-example mutual conditioning. Both modules are exclusively based on MLPs, but benefit from being both global and `fully-convolutional,' two desired properties for low-level vision tasks. Our extensive experimental results show that our proposed MAXIM model achieves state-of-the-art performance on more than ten benchmarks across a range of image processing tasks including denoising, deblurring, deraining, dehazing, and enhancement with less or comparable parameters and FLOPs. View details
    Preview abstract We study the effect of normalization on single domain generalization, the goal of which is to learn a model that performs well on many unseen domains with only single do-main data for training. We propose a new type of normalization, LSLR , that has an adaptive form that generalizes other normalizations. The key idea is to learn both the standardization and rescaling statistics for normalization with neural networks. This new normalization has better adaptivity and is capable of helping model generalize better for single domain generalization with a robust objective. Combined with adversarial domain augmentation methods, we can optimize the robust objective approximately. We show that our method consistently outperforms the baselines and achieves state-of-the-art results on three standard bench-marks for single domain generalization. View details
    Multi-path Neural Networks for On-device Multi-domain Visual Classification
    Andrew Howard
    Gabriel M. Bender
    Grace Chu
    Jeff Gilbert
    Joshua Greaves
    Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (2021), pp. 3019-3028
    Preview abstract Learning multiple domains/tasks with a single model is important for improving data efficiency and lowering inference cost for numerous vision tasks, especially on resource-constrained mobile devices. However, hand-crafting a multi-domain/task model can be both tedious and challenging. This paper proposes a novel approach to automatically learn a multi-path network for multi-domain visual classification on mobile devices. The proposed multi-path network is learned from neural architecture search by applying one reinforcement learning controller for each domain to select the best path in the super-network created from a MobileNetV3-like search space. An adaptive balanced domain prioritization algorithm is proposed to balance optimizing the joint model on multiple domains simultaneously. The determined multi-path model selectively shares parameters across domains in shared nodes while keeping domain-specific parameters within non-shared nodes in individual domain paths. This approach effectively reduces the total number of parameters and FLOPS, encouraging positive knowledge transfer while mitigating negative interference across domains. Extensive evaluations on the Visual Decathlon dataset demonstrate that the proposed multi-path model achieves state-of-the-art performance in terms of accuracy, model size, and FLOPS against other approaches using MobileNetV3-like architectures. Furthermore, the proposed method improves average accuracy over learning single-domain models individually, and reduces the total number of parameters and FLOPS by 78% and 32% respectively, compared to the approach that simply bundles single-domain models for multi-domain learning. View details
    Preview abstract Most video super-resolution methods focus on restoring high-resolution video frames from low-resolution videos without taking into account compression. However, most videos on the web or mobile devices are compressed, and the compression can be severe when the bandwidth is limited. In this paper, we propose a new compression-informed video super-resolution model to restore high-resolution content without introducing artifacts caused by compression. The proposed model consists of three modules for video super-resolution: bi-directional recurrent warping, detail-preserving flow estimation, and Laplacian enhancement. All these three modules are used to deal with compression properties such as the location of the intra-frames in the input and smoothness in the output frames. For thorough performance evaluation, we conducted extensive experiments on standard datasets with a wide range of compression rates, covering many real video use cases. We showed that our method not only recovers high-resolution content on uncompressed frames from the widely-used benchmark datasets, but also achieves state-of-the-art performance in super-resolving compressed videos based on numerous quantitative metrics. We also evaluated the proposed method by simulating streaming from YouTube to demonstrate its effectiveness and robustness. The source codes and trained models are available at https://github.com/google-research/googleresearch/tree/master/comisr. View details
    Preview abstract Lossy Image compression is necessary for efficient storage and transfer of data. Typically the trade-off between bit-rate and quality determines the optimal compression level. This makes the image quality metric an integral part of any imaging system. While the existing full-reference metrics such as PSNR and SSIM may be less sensitive to perceptual quality, the recently introduced learning methods may fail to generalize to unseen data. In this paper we propose the largest image compression quality dataset to date with human perceptual preferences, enabling the use of deep learning, and we develop a full reference perceptual quality assessment metric for lossy image compression that outperforms the existing state-of-the-art methods. We show that the proposed model can effectively learn from thousands of examples available in the new dataset, and consequently it generalizes better to other unseen datasets of human perceptual preference. The CIQA dataset can be found at https://github.com/googleresearch/google-research/tree/master/CIQA View details
    MUSIQ: Multi-scale Image Quality Transformer
    Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021)
    Preview abstract Image quality assessment (IQA) is an important research topic for understanding and improving visual experience. The current state-of-the-art IQA methods are based on convolutional neural networks (CNNs). The performance of CNN-based models is often compromised by the fixed shape constraint in batch training. To accommodate this, the input images are usually resized and cropped to a fixed shape, causing image quality degradation. To address this, we design a multi-scale image quality Transformer (MUSIQ) to process native resolution images with varying sizes and aspect ratios. With a multi-scale image representation, our proposed method can capture image quality at different granularities. Furthermore, a novel hash-based 2D spatial embedding and a scale embedding is proposed to support the positional embedding in the multi-scale representation. Experimental results verify that our method can achieve state-of-the-art performance on multiple large scale IQA datasets such as PaQ-2-PiQ, SPAQ, and KonIQ-10k. View details
    Preview abstract Could we compress images via standard codecs while avoiding visible artifacts? The answer is obvious -- this is doable as long as the bit budget is generous enough. What if the allocated bit-rate for compression is insufficient? Then unfortunately, artifacts are a fact of life. Many attempts were made over the years to fight this phenomenon, with various degrees of success. In this work we aim to break the unholy connection between bit-rate and image quality, and propose a way to circumvent compression artifacts by pre-editing the incoming image and modifying its content to fit the given bits. We design this editing operation as a learned convolutional neural network, and formulate an optimization problem for its training. Our loss takes into account a proximity between the original image and the edited one, a bit-budget penalty over the proposed image, and a no-reference image quality measure for forcing the outcome to be visually pleasing. The proposed approach is demonstrated on the popular JPEG compression, showing savings in bits and/or improvements in visual quality, obtained with intricate editing effects. View details
    Preview abstract JPEG is an old yet popular image compression format, sup-ported by all imaging devices and software packages. A key ingredientgoverning its performance are the two quantization tables (for Luma andChroma) that dictate the loss induced on each DCT coefficient. Pastwork has offered various ideas for better tuning these tables, mainly fo-cusing on rate-distortion performance and using derivative-free optimiza-tion techniques. This work offers a novel optimal tuning of these tablesvia continuous optimization, leveraging a differential implementation ofboth the JPEG encoder-decoder and an entropy estimator. This enablesus to offer a unified framework that considers the interplay between fourperformance measures: rate, distortion, perceptual quality, and classi-fication accuracy. We also propose a deep-neural network design thatautomatically assigns optimized quantization tables to each incomingimage. In all these fronts, we report a substantial boost in performanceby a simple and easily implemented modification of these tables. View details
    Preview abstract Video quality assessment for User Generated Content (UGC) is an important topic in both industry and academia. Most existing methods only focus on one aspect of the perceptual quality assessment, such as technical quality or compression artifacts. In this paper, we create a large scale dataset to comprehensively investigate characteristics of generic UGC video quality. Besides the subjective ratings and content labels of the dataset, we also propose a DNN-based framework to thoroughly analyze importance of content, technical quality, and compression level in perceptual quality. Our model is able to provide quality scores as well as human-friendly quality indicators, to bridge the gap between low level video signals to human perceptual quality. Experimental results show that our model achieves state-of-the-art correlation with Mean Opinion Scores (MOS). View details
    Distortion Blind Deep Watermarking
    Ruohan Zhan
    Huiwen Chang
    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
    Preview abstract Watermarking is the process of embedding information into an image that can survive under distortions, while requiring the encoded image to have little or no perceptual difference with the original image. Recently, deep learning-based methods achieved impressive results in both visual quality and message payload under a wide variety of image distortions. However, these methods all require differentiable models for the image distortions at training time, and may generalize poorly to unknown distortions. This is undesirable since the types of distortions applied to watermarked images are usually unknown and non-differentiable. In this paper, we propose a new framework for distortion-agnostic watermarking, where the image distortion is not explicitly modeled during training. Instead, the robustness of our system comes from two sources: adversarial training and channel coding. Compared to training on a fixed set of distortions and noise levels, our method achieves comparable or better results on distortions available during training, and better performance overall on unknown distortions. View details
    GIFnets: An end-to-end neural network based GIF encoding framework
    Innfarn Yoo
    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Seattle (2020)
    Preview abstract Graphics Interchange Format (GIF) is a widely used image file format. Due to the limited number of palette colors, GIF encoding often introduces color banding artifacts. Traditionally, dithering is applied to reduce color banding, but introducing dotted-pattern artifacts. To reduce artifacts and provide a better and more efficient GIF encoding, we introduce a differentiable GIF encoding pipeline, which includes three novel neural networks: PaletteNet, DitherNet, and BandingNet. Each of these three networks provides an important functionality within the GIF encoding pipeline. PaletteNet predicts a near-optimal color palette given an input image. DitherNet manipulates the input image to reduce color banding artifacts and provides an alternative to traditional dithering. Finally, BandingNet is designed to detect color banding, and provides a new perceptual loss specifically for GIF images. As far as we know, this is the first fully differentiable GIF encoding pipeline based on deep neural networks and compatible with existing GIF decoders. User study shows that our algorithm is better than Floyd-Steinberg based GIF encoding. View details
    Preview abstract Todays video transcoding pipelines choose transcoding parameters based on Rate-Distortion curves, which mainlyfocuses on the relative quality difference between original and transcoded videos. By investigating recentlyreleased YouTube UGC dataset, we found that people were more tolerant to the quality changes in low qualityinputs than in high quality inputs, which suggests that current transcoding framework could be further optimizedby considering input perceptual quality. An efficient machine learning based metric was proposed to detect lowquality inputs, whose bitrate can be further reduced without hurting perceptual quality. To evaluate the impacton perceptual quality, we conducted a crowd-sourcing subjective experiment, and provided a methodology toevaluate statistical significance among different treatments. The results showed that the proposed quality guidedtranscoding framework is able to reduce the average bitrate upto 5% with insignificant quality degradation. View details
    Super-resolving Commercial Satellite Imagery Using Realistic Training Data
    Xiang Zhu
    Xinwei Shi
    IEEE International Conference on Image Processing 2020 (2020)
    Preview abstract In machine learning based single image super-resolution, the degradation model is embedded in training data generation. However, most existing satellite image super-resolution methods use a simple down-sampling model with a fixed kernel to create training images. These methods work fine on synthetic data, but do not perform well on real satellite images. We proposed a realistic training data generation model for commercial satellite imagery products, which includes not only the imaging process on satellites but also the post-process on the ground. We also proposed a convolutional neural network optimized for satellite images. Experiments show that the proposed training data generation model is able to improve super-resolution performance on real satellite images. View details
    The Gigavision Camera
    Edoardo Charbon
    Sabine Susstrunk
    Martin Vetterli
    Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE (2009)
    Preview abstract We propose a new image device called gigavision camera. The main differences between a conventional and a gigavision camera are that the pixels of the gigavision camera are binary and orders of magnitude smaller. A gigavision camera can be built using standard memory chip technology, where each memory bit is designed to be light sensitive. A conventional gray level image can be obtained from the binary gigavision image by low-pass filtering and sampling. The main advantage of the gigavision camera is that its response is non-linear and similar to a logarithmic function, which makes it suitable for acquiring high dynamic range scenes. The larger the number of binary pixels considered, the higher the dynamic range of the gigavision camera will be. In addition, the binary sensor of the gigavision camera can be combined with a lens array in order to realize an extremely thin camera. Due to the small size of the pixels, this design does not require deconvolution techniques typical of similar systems based on conventional sensors. View details
    Image Reconstruction in the Gigavision Camera
    Edoardo Charbon
    Sabine Susstrunk
    Martin Vetterli
    ICCV workshop OMNIVIS 2009
    Preview abstract The main feature of this camera is that the pixels have a binary response. The response function of a gigavision sensor is non-linear and similar to a logarithmic function, which makes the camera suitable for high dynamic range imaging. Since the sensor can detect a single photon, the camera is very sensitive and can be used for night vision and astronomical imaging. One important aspect of the gigavision camera is how to estimate the light intensity through binary observations. We model the light intensity field as 2D piecewise constant and use Maximum Penalized Likelihood Estimation (MPLE) to recover it. Dynamic programming is used to solve the optimization problem. Due to the complex computation of dynamic programming, greedy algorithm and pruning quadtrees are proposed. They show acceptable reconstruction performance with low computational complexity. Experimental results with synthesized images and real images taken by a single-photon avalanche diode (SPAD) camera are given. View details
    No Results Found