Li Zhang

Li Zhang

I am a software engineer at Google. I work on computer vision and machine learning related product research and development. My work contributed to Google products such as Lens Blur in Google Camera, pose estimation in Cardboard Camera. I led a computer vision team for Google Clips, developing image understanding, synthesis, and enhancement techniques. Some of the works are published in research conferences. I am now working on vision technologies for Google Pixel phones and Google Photos services. Prior to joining Google, I was a computer science faculty at University of Wisconsin, Madison, researching on computer vision and graphics. My academic works and awards are summarized on this page.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    FrameQuant: Flexible Low-Bit Quantization for Transformers
    Harshavardhan Adepu
    Zhanpeng Zeng
    Vikas Singh
    International Conference on Machine Learning (2024)
    Preview abstract Transformers are the backbone of powerful foundation models for many Vision and Natural Language Processing tasks. But their compute and memory/storage footprint is large, and so, serving such models is expensive often requiring high-end hardware. To mitigate this difficulty, Post-Training Quantization seeks to modify a pre-trained model and quantize it to eight bits or lower, significantly boosting compute/memory/latency efficiency. Such models have been successfully quantized to four bits with some performance loss. In this work, we outline a simple scheme to quantize Transformer-based models to just two bits (plus some overhead) with only a small drop in accuracy. Key to our formulation is a concept borrowed from Harmonic analysis called Fusion Frames. Our main finding is that the quantization must take place not in the original weight space, but instead in the Fusion Frame representations. If quantization is interpreted as the addition of noise, our casting of the problem allows invoking an extensive body of known consistent recovery and noise robustness guarantees. Further, if desired, denoising filters are known in closed form. We show empirically, via a variety of experiments, that (almost) two-bit quantization for Transformer models promises sizable efficiency gains. View details
    BasisNet: Two-Stage Model Synthesis for Efficient Inference
    Chun-Te Chu
    Andrew Howard
    Yukun Zhu
    Rebecca Hwa
    Adriana Kovashka
    CVPR Workshop on Efficient Deep Learning for Computer Vision (ECV) (2021)
    Preview abstract In this work, we present BasisNet which combines recent advancements in efficient neural network architectures, conditional computation, and early termination in a simple new form. Our approach incorporates a lightweight model to preview the input and generate input-dependent combination coefficients, which later controls the synthesis of a more accurate specialist model to make final prediction. The two-stage model synthesis strategy can be applied to any network architectures and both stages are jointly trained. We also show that proper training recipes are critical for increasing generalizability for such high capacity neural networks. On ImageNet classification benchmark, our BasisNet with MobileNets as backbone demonstrated clear advantage on accuracy-efficiency trade-off over several strong baselines. Specifically, BasisNet-MobileNetV3 obtained 80.3% top-1 accuracy with only 290M Multiply-Add operations, halving the computational cost of previous state-of-the-art without sacrificing accuracy. With early termination, the average cost can be further reduced to 198M MAdds while maintaining accuracy of 80.0% on ImageNet. View details
    MoViNets: Mobile Video Networks for Efficient Video Recognition
    Dan Kondratyuk
    Liangzhe Yuan
    Yandong Li
    Matthew Brown
    Boqing Gong
    CVPR 2021 (2021)
    Preview abstract We present Mobile Video Networks (MoViNets), a family of computation and memory efficient video networks that can operate on streaming video for online inference. 3D convolutional neural networks (CNNs) are accurate at video recognition but require large computation and memory budgets and do not support online inference, making them difficult to work on mobile devices. We propose a three-step approach to improve computational efficiency while substantially reducing the peak memory usage of 3D CNNs. First, we design a video network search space and employ neural architecture search to generate efficient and diverse 3D CNN architectures. Second, we introduce the Stream Buffer technique that decouples memory from video clip duration, allowing 3D CNNs to embed arbitrary-length streaming video sequences for both training and inference with a small constant memory footprint. Third, we propose a simple ensembling technique to improve accuracy further without sacrificing efficiency. These three progressive techniques allow MoViNets to achieve state-of-the-art accuracy and efficiency on the Kinetics, Moments in Time, and Charades video action recognition datasets. For instance, MoViNet-A5-Stream achieves the same accuracy as X3D-XL on Kinetics 600 while requiring 80% fewer FLOPs and 65% less memory. Code is available at https://github.com/tensorflow/models/tree/master/official/projects/movinet. View details
    Spatially Adaptive Computation Time for Residual Networks
    Dmitry P. Vetrov
    Jonathan Huang
    Maxwell Collins
    Michael Figurnov
    Ruslan Salakhutdinov
    Yukun Zhu
    IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
    Preview abstract This paper proposes a deep learning architecture based on Residual Network that dynamically adjusts the number of executed layers for the regions of the image. This architecture is end-to-end trainable, deterministic and problem-agnostic. It is therefore applicable without any modifications to a wide range of computer vision problems such as image classification, object detection and image segmentation. We present experimental results showing that this model improves the computational efficiency of ResNet on the challenging ImageNet classification and COCO object detection datasets. Additionally, we evaluate the computation time maps on the image saliency dataset cat2000 and find that they correlate surprisingly well with human eye fixation positions. View details
    Soft 3D Reconstruction for View Synthesis
    Eric Penner
    ACM Transactions on Graphics (Proc. SIGGRAPH Asia), 36 (2017) (to appear)
    Preview abstract We present a novel algorithm for view synthesis that utilizes a soft 3D reconstruction to improve quality, continuity and robustness. Our main contribution is the formulation of a soft 3D representation that preserves depth uncertainty through each stage of 3D reconstruction and rendering. We show that this representation is beneficial throughout the view synthesis pipeline. During view synthesis, it provides a soft model of scene geometry that provides continuity across synthesized views and robustness to depth uncertainty. During 3D reconstruction, the same robust estimates of scene visibility can be applied iteratively to improve depth estimation around object edges. Our algorithm is based entirely on O(1) filters, making it conducive to acceleration and it works with structured or unstructured sets of input views. We compare with recent classical and learning-based algorithms on plenoptic lightfields, wide baseline captures, and lightfield videos produced from camera arrays. View details