Zhengzhong Tu
Research Areas
Authored Publications
Google Publications
Other Publications
Sort By
MaxViT: Multi-Axis Vision Transformer
Han Zhang
Alan Bovik
European Conference on Computer Vision (ECCV) (2022)
Preview abstract
Transformers have recently gained significant attention in the computer vision community. However, the lack of scalability of self-attention mechanisms with respect to image size has limited their wide adoption in state-of-the-art vision backbones. In this paper we introduce an efficient and scalable attention model we call multi-axis attention, which consists of two aspects: blocked local and dilated global attention. These design choices allow global-local spatial interactions on arbitrary input resolutions with only linear complexity. We also present a new architectural element by effectively blending our proposed attention model with convolutions, and accordingly propose a simple hierarchical vision backbone, dubbed MaxViT, by simply repeating the basic building block over multiple stages. Notably, MaxViT is able to “see” globally throughout the entire network, even in earlier, high-resolution stages. We demonstrate the effectiveness of our model on a broad spectrum of vision tasks. On image classification, MaxViT achieves state-of-the-art performance under various settings: without extra data, MaxViT attains 86.5% ImageNet-1K top-1 accuracy; with ImageNet-21K pre-training,
our model achieves 88.7% top-1 accuracy. For downstream tasks, MaxViT as a backbone delivers favorable performance on object detection as well as visual aesthetic assessment. We also show that our proposed model expresses strong generative modeling capability on ImageNet, demonstrating the superior potential of MaxViT blocks as a universal vision module. The source code and trained models will be available at https://github.com/google-research/maxvit.
View details
MAXIM: Multi-Axis MLP for Image Processing
Han Zhang
Alan Bovik
IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022)
Preview abstract
Recent progress on Transformers and MLP-like models has shown new architecture design paradigms on many computer vision tasks. However, efficacy and efficiency of these models for low-level vision tasks have not been studied extensively. In this paper, we present MAXIM, a general image processing architecture with multi-axis gated MLPs, to advance the possibility of global operators for low-level vision. Our single-stage MAXIM backbone shares a UNet-shaped hierarchy structure and enjoys a long-range interaction brought by spatial-gated MLPs. Specifically, MAXIM contains two MLP-based building blocks. First, we devise a multi-axis gated MLP that allows efficient and scalable spatial mixing of local and global information. Second, we propose a cross-gating block, an alternative to cross-attention, which accounts for cross-example mutual conditioning. Both modules are exclusively based on MLPs, but benefit from being both global and `fully-convolutional,' two desired properties for low-level vision tasks. Our extensive experimental results show that our proposed MAXIM model achieves state-of-the-art performance on more than ten benchmarks across a range of image processing tasks including denoising, deblurring, deraining, dehazing, and enhancement with less or comparable parameters and FLOPs.
View details
No Results Found