Feng Yang

Feng Yang

Feng Yang is a Senior Staff Software Engineer at Google Research, where he leads the AIM Engine team. He was a postdoctoral researcher at Illumination&Imaging Lab, Robotics Institute, CMU, supervised by Prof. Srinivasa Narasimhan. He interned at Intel Research Center, Beijing advised by Eric Li and Nokia Research, Palo Alto, CA advised by Kari Pulli. He was a Research Assistant with the Broadband Network and Digital Multimedia Laboratory, Tsinghua University, advised by Prof. Qionghai Dai, member of Chinese Academy of Engineering and Audiovisual Communications Laboratory, EPFL, advised by Prof. Martin Vetterli, President of EPFL and member the US National Academy of Engineering. His research interests include deep learning, image quality assessment, digital rights management, image and video synthesis, image and video processing, computational photography, video streaming, distributed video coding, sampling theories, and mobile sensing. Most recently, he is working on building the next generation image/video processing, serving and storage system with AI. A lot of his algorithms are landed in Google's products and significantly increased the company's revenue and reduced cost. Feng Yang received the B.Eng. and M.Eng. degrees in automatic control from Tsinghua University, Beijing, China, in 2004 and 2007, respectively. He won Outstanding master thesis of Tsinghua University. He got Ph.D. degree in communication systems at the École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland in 2012. He won Fritz Kutter Award for best PhD thesis. He holds several patents and some of them sold to Rambus Inc.

Research Areas

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Rich Human Feedback for Text to Image Generation
    Katherine Collins
    Nicholas Carolan
    Yang Li
    Youwei Liang
    Peizhao Li
    Dj Dvijotham
    Junfeng He
    Sarah Young
    Jiao Sun
    Arseniy Klimovskiy
    Preview abstract Recent Text-to-Image (T2I) generation models such as Stable Diffusion and Imagen have made significant progress in generating high-resolution images based on text descriptions. However, many generated images still suffer from issues such as artifacts/implausibility, misalignment with text descriptions, and low aesthetic quality. Inspired by the success of Reinforcement Learning with Human Feedback (RLHF) for large language models, prior work collected human-provided scores as feedback on generated images and trained a reward model to improve the T2I generation. In this paper, we enrich the feedback signal by (i) marking image regions that are implausible or misaligned with the text, and (ii) annotating which keywords in the text prompt are not represented in the image. We collect such rich human feedback on 18K generated images and train a multimodal transformer to predict these rich feedback automatically. We show that the predicted rich human feedback can be leveraged to improve image generation, for example, by selecting high-quality training data to finetune and improve the generative models, or by creating masks with predicted heatmaps to inpaint the problematic regions. Notably, the improvements generalize to models (Muse) beyond those used to generate the images on which human feedback data were collected (Stable Diffusion variants). View details
    SVDiff: Compact Parameter Space for Diffusion Fine-Tuning
    Ligong Han
    Han Zhang
    Dimitris Metaxas
    IEEE/CVF International Conference on Computer Vision (ICCV)(2023)
    Preview abstract Diffusion models have achieved remarkable success in text-to-image generation, enabling the creation of high-quality images from text prompts or other modalities. However, existing methods for customizing these models are limited by handling multiple personalized subjects and the risk of overfitting. Moreover, their large number of parameters is inefficient for model storage. In this paper, we propose a novel approach to address these limitations in existing text-to-image diffusion models for personalization. Our method involves fine-tuning the singular values of the weight matrices, leading to a compact and efficient parameter space that reduces the risk of overfitting and language-drifting. We also propose a Cut-Mix-Unmix data-augmentation technique to enhance the quality of multi-subject image generation and a simple text-based image editing framework. Our proposed SVDiff method has a significantly smaller model size (1.7MB for StableDiffusion) compared to existing methods (vanilla DreamBooth 3.66GB, Custom Diffusion 73MB), making it more practical for real-world applications. View details
    DVMark: A Deep Multiscale Network for Video Watermarking
    Xiyang Luo
    Huiwen Chang
    Ce Liu
    IEEE Transactions on Image Processing(2023)
    Preview abstract Video watermarking embeds a message into a cover video in an imperceptible manner, which can be retrieved even if the video undergoes certain modifications or distortions. Traditional watermarking methods are often manually designed for particular types of distortions and thus cannot simultaneously handle a broad spectrum of distortions. To this end, we propose a robust deep learning-based solution for video watermarking that is end-to-end trainable. Our model consists of a novel multiscale design where the watermarks are distributed across multiple spatial-temporal scales. Extensive evaluations on a wide variety of distortions show that our method outperforms traditional video watermarking methods as well as deep image watermarking models by a large margin. We further demonstrate the practicality of our method on a realistic video-editing application. View details
    Deep 3D-to-2D Watermarking: Embedding Messages in 3D Meshes and Extracting Them from 2D Renderings
    Ce Liu
    Huiwen Chang
    Innfarn Yoo
    Ondrej Stava
    Xiyang Luo
    Computer Vision and Pattern Recognition(2022)
    Preview abstract Digital watermarking is widely used for copyright protection. Traditional 3D watermarking approaches or commercial software are typically designed to embed messages into 3D meshes, and later retrieve the messages directly from distorted/undistorted watermarked 3D meshes. However, in many cases, users only have access to rendered 2D images instead of 3D meshes. Unfortunately, retrieving messages from 2D renderings of 3D meshes is still challenging and underexplored. We introduce a novel end-toend learning framework to solve this problem through: 1) an encoder to covertly embed messages in both mesh geometry and textures; 2) a differentiable renderer to render watermarked 3D objects from different camera angles and under varied lighting conditions; 3) a decoder to recover the messages from 2D rendered images. From our experiments, we show that our model can learn to embed information visually imperceptible to humans, and to retrieve the embedded information from 2D renderings that undergo 3D distortions. In addition, we demonstrate that our method can also work with other renderers, such as ray tracers and real-time renderers with and without fine-tuning. View details
    MAXIM: Multi-Axis MLP for Image Processing
    Zhengzhong Tu
    Han Zhang
    Alan Bovik
    IEEE/CVF Conference on Computer Vision and Pattern Recognition(2022)
    Preview abstract Recent progress on Transformers and MLP-like models has shown new architecture design paradigms on many computer vision tasks. However, efficacy and efficiency of these models for low-level vision tasks have not been studied extensively. In this paper, we present MAXIM, a general image processing architecture with multi-axis gated MLPs, to advance the possibility of global operators for low-level vision. Our single-stage MAXIM backbone shares a UNet-shaped hierarchy structure and enjoys a long-range interaction brought by spatial-gated MLPs. Specifically, MAXIM contains two MLP-based building blocks. First, we devise a multi-axis gated MLP that allows efficient and scalable spatial mixing of local and global information. Second, we propose a cross-gating block, an alternative to cross-attention, which accounts for cross-example mutual conditioning. Both modules are exclusively based on MLPs, but benefit from being both global and `fully-convolutional,' two desired properties for low-level vision tasks. Our extensive experimental results show that our proposed MAXIM model achieves state-of-the-art performance on more than ten benchmarks across a range of image processing tasks including denoising, deblurring, deraining, dehazing, and enhancement with less or comparable parameters and FLOPs. View details
    MaxViT: Multi-Axis Vision Transformer
    Zhengzhong Tu
    Han Zhang
    Alan Bovik
    European Conference on Computer Vision (ECCV)(2022)
    Preview abstract Transformers have recently gained significant attention in the computer vision community. However, the lack of scalability of self-attention mechanisms with respect to image size has limited their wide adoption in state-of-the-art vision backbones. In this paper we introduce an efficient and scalable attention model we call multi-axis attention, which consists of two aspects: blocked local and dilated global attention. These design choices allow global-local spatial interactions on arbitrary input resolutions with only linear complexity. We also present a new architectural element by effectively blending our proposed attention model with convolutions, and accordingly propose a simple hierarchical vision backbone, dubbed MaxViT, by simply repeating the basic building block over multiple stages. Notably, MaxViT is able to “see” globally throughout the entire network, even in earlier, high-resolution stages. We demonstrate the effectiveness of our model on a broad spectrum of vision tasks. On image classification, MaxViT achieves state-of-the-art performance under various settings: without extra data, MaxViT attains 86.5% ImageNet-1K top-1 accuracy; with ImageNet-21K pre-training, our model achieves 88.7% top-1 accuracy. For downstream tasks, MaxViT as a backbone delivers favorable performance on object detection as well as visual aesthetic assessment. We also show that our proposed model expresses strong generative modeling capability on ImageNet, demonstrating the superior potential of MaxViT blocks as a universal vision module. The source code and trained models will be available at https://github.com/google-research/maxvit. View details
    Preview abstract Most video super-resolution methods focus on restoring high-resolution video frames from low-resolution videos without taking into account compression. However, most videos on the web or mobile devices are compressed, and the compression can be severe when the bandwidth is limited. In this paper, we propose a new compression-informed video super-resolution model to restore high-resolution content without introducing artifacts caused by compression. The proposed model consists of three modules for video super-resolution: bi-directional recurrent warping, detail-preserving flow estimation, and Laplacian enhancement. All these three modules are used to deal with compression properties such as the location of the intra-frames in the input and smoothness in the output frames. For thorough performance evaluation, we conducted extensive experiments on standard datasets with a wide range of compression rates, covering many real video use cases. We showed that our method not only recovers high-resolution content on uncompressed frames from the widely-used benchmark datasets, but also achieves state-of-the-art performance in super-resolving compressed videos based on numerous quantitative metrics. We also evaluated the proposed method by simulating streaming from YouTube to demonstrate its effectiveness and robustness. The source codes and trained models are available at https://github.com/google-research/googleresearch/tree/master/comisr. View details
    MUSIQ: Multi-scale Image Quality Transformer
    Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)(2021)
    Preview abstract Image quality assessment (IQA) is an important research topic for understanding and improving visual experience. The current state-of-the-art IQA methods are based on convolutional neural networks (CNNs). The performance of CNN-based models is often compromised by the fixed shape constraint in batch training. To accommodate this, the input images are usually resized and cropped to a fixed shape, causing image quality degradation. To address this, we design a multi-scale image quality Transformer (MUSIQ) to process native resolution images with varying sizes and aspect ratios. With a multi-scale image representation, our proposed method can capture image quality at different granularities. Furthermore, a novel hash-based 2D spatial embedding and a scale embedding is proposed to support the positional embedding in the multi-scale representation. Experimental results verify that our method can achieve state-of-the-art performance on multiple large scale IQA datasets such as PaQ-2-PiQ, SPAQ, and KonIQ-10k. View details
    Preview abstract JPEG is an old yet popular image compression format, sup-ported by all imaging devices and software packages. A key ingredientgoverning its performance are the two quantization tables (for Luma andChroma) that dictate the loss induced on each DCT coefficient. Pastwork has offered various ideas for better tuning these tables, mainly fo-cusing on rate-distortion performance and using derivative-free optimiza-tion techniques. This work offers a novel optimal tuning of these tablesvia continuous optimization, leveraging a differential implementation ofboth the JPEG encoder-decoder and an entropy estimator. This enablesus to offer a unified framework that considers the interplay between fourperformance measures: rate, distortion, perceptual quality, and classi-fication accuracy. We also propose a deep-neural network design thatautomatically assigns optimized quantization tables to each incomingimage. In all these fronts, we report a substantial boost in performanceby a simple and easily implemented modification of these tables. View details
    Preview abstract We study the effect of normalization on single domain generalization, the goal of which is to learn a model that performs well on many unseen domains with only single do-main data for training. We propose a new type of normalization, LSLR , that has an adaptive form that generalizes other normalizations. The key idea is to learn both the standardization and rescaling statistics for normalization with neural networks. This new normalization has better adaptivity and is capable of helping model generalize better for single domain generalization with a robust objective. Combined with adversarial domain augmentation methods, we can optimize the robust objective approximately. We show that our method consistently outperforms the baselines and achieves state-of-the-art results on three standard bench-marks for single domain generalization. View details