Jump to Content
Zizhao Zhang

Zizhao Zhang

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract While remarkable progress has been made in imbalanced supervised learning, less attention has been given to the setting of imbalanced semi-supervised learning (SSL) where not only are few labeled data provided, but the underlying data distribution can be severely imbalanced. Recent work requires both complicated sampling strategies of pseudo-labeled unlabeled data and distribution alignment of the pseudo-label distribution to accommodate this imbalance. We present a novel approach that relies only on a form of a distribution alignment but no sampling strategy where rather than aligning the pseudo-labels during inference, we move the distribution alignment component into the respective cross entropy loss computations for both the supervised and unsupervised losses. This alignment compensates for both imbalance in the data and the eventual distributional shift present during evaluation. Altogether, this provides a unified strategy that offers both significantly reduced training requirements and improved performance across both low and richly labeled regimes and over varying degrees of imbalance. In experiments, we validate the efficacy of our method on SSL variants of CIFAR10-LT, CIFAR100-LT, and ImageNet-127. On ImageNet-127, our method shows 1.6% accuracy improvement over CReST with an 80% training time reduction and is competitive with other SOTA methods. View details
    Preview abstract Zero-shot transfer learning for document understanding is a crucial yet under-investigated scenario to help reduce the high cost involved in annotating document entities. We present a novel query-based framework, QueryForm, that extracts entity values from form-like documents in a zero-shot fashion. QueryForm contains a dual prompting mechanism that composes both the document schema and a specific entity type into a query, which is used to prompt a Transformer model to perform a single entity extraction task. Furthermore, we propose to leverage large-scale query-entity pairs generated from form-like webpages with weak HTML annotations to pre-train QueryForm. By unifying pre-training and fine-tuning into the same query-based framework, QueryForm enables models to learn from structured documents containing various entities and layouts, leading to better generalization to target document types without the need for target-specific training data. QueryForm sets new state-of-the-art average F1 score on both the XFUND (+4.6%~10.1%) and the Payment (+3.2%~9.5%) zero-shot benchmark, with a smaller model size and no additional image input. View details
    Preview abstract Continual learning aims to enable a single model to learn a sequence of tasks without catastrophic forgetting. Top-performing methods usually require a rehearsal buffer to store past pristine examples for experience replay, which, however, limits their practical value due to privacy and memory constraints. In this work, we present a simple yet effective framework, DualPrompt, which learns a tiny set of parameters, called prompts, to properly instruct a pre-trained model to learn tasks arriving sequentially without buffering past examples. DualPrompt presents a novel approach to attach complementary prompts to the pre-trained backbone, and then formulates the objective as learning task-invariant and task-specific "instructions". With extensive experimental validation, DualPrompt consistently sets state-of-the-art performance under the challenging class-incremental setting. In particular, DualPrompt outperforms recent advanced continual learning methods with relatively large buffer sizes. We also introduce a more challenging benchmark, Split ImageNet-R, to help generalize rehearsal-free continual learning research. Source code is available at https://github.com/google-research/l2p. View details
    Preview abstract The mainstream paradigm behind continual learning has been to adapt the model parameters to non-stationary data distributions, where catastrophic forgetting is the central challenge. Typical methods rely on a rehearsal buffer or known task identity at test time to retrieve learned knowledge and address forgetting, while this work presents a new paradigm for continual learning that aims to train a more succinct memory system without accessing task identity at test time. Our method learns to dynamically prompt (L2P) a pre-trained model to learn tasks sequentially under different task transitions. In our proposed framework, prompts are small learnable parameters, which are maintained in a memory space. The objective is to optimize prompts to instruct the model prediction and explicitly manage task-invariant and task-specific knowledge while maintaining model plasticity. We conduct comprehensive experiments under popular image classification benchmarks with different challenging continual learning settings, where L2P consistently outperforms prior state-ofthe-art methods. Surprisingly, L2P achieves competitive results against rehearsal-based methods even without a rehearsal buffer and is directly applicable to challenging taskagnostic continual learning. Source code is available at https://github.com/google-research/l2p. View details
    Preview abstract Hierarchical structures are popular in recent vision transformers, however, they require sophisticated designs and massive datasets to work well. In this paper, we explore the idea of nesting basic local transformers on non-overlapping image blocks and aggregating them in a hierarchical way. We find that the block aggregation function plays a critical role in enabling cross-block non-local information communication. This observation leads us to design a simplified architecture that requires minor code changes upon the original vision transformer. The benefits of the proposed judiciously-selected design are threefold: (1) NesT converges faster and requires much less training data to achieve good generalization on both ImageNet and small datasets like CIFAR; (2) when extending our key ideas to image generation, NesT leads to a strong decoder that is 8$\times$ faster than previous transformer-based generators; and (3) we show that decoupling the feature learning and abstraction processes via this nested hierarchy in our design enables constructing a novel method (named GradCAT) for visually interpreting the learned model. Source code is available https://github.com/google-research/nested-transformer. View details
    Preview abstract Learning visual knowledge from massive weakly-labeled web videos has attracted growing research interests thanks to the large corpus of easily accessible video data on the Internet. However, for video action recognition, the action of interest might only exist in arbitrary clips of untrimmed web videos, resulting in high label noises in the temporal space. To address this issue, we introduce a new method for pretraining video action recognition models using queried web videos. Instead of trying to filter out, we propose to convert the potential noises in these queried videos to useful supervision signals by defining the concept of Sub-Pseudo Label (SPL). Specifically, SPL spans out a new set of meaningful “middle ground” label space constructed by extrapolating the original weak labels during video querying and the prior knowledge distilled from a teacher model. Consequently, SPL provides enriched supervision for video models to learn better representations. SPL is fairly simple and orthogonal to popular teacher-student self-training frameworks without extra training cost. We validate the effectiveness of our method on four video action recognition datasets and a weakly-labeled image dataset to study the generalization ability. Experiments show that SPL outperforms several existing pre-training strategies using pseudolabels and the learned representations lead to competitive results when fine-tuning on HMDB-51 and UCF-101 compared with recent pre-training methods View details
    Improved Consistency Regularization for GANs
    Zhengli Zhao
    Sameer Singh
    Honglak Lee
    Augustus Odena
    Han Zhang
    Proceedings of the AAAI Conference on Artificial Intelligence (2021)
    Preview abstract Recent work (Zhang et al. 2020) has increased the performance of Generative Adversarial Networks (GANs) by enforcing a consistency cost on the discriminator. We improve on this technique in several ways. We first show that consistency regularization can introduce artifacts into the GAN samples and explain how to fix this issue. We then propose several modifications to the consistency regularization procedure designed to improve its performance. We carry out extensive experiments quantifying the benefit of our improvements. For unconditional image synthesis on CIFAR-10 and CelebA, our modifications yield the best known FID scores on various GAN architectures. For conditional image synthesis on CIFAR-10, we improve the state-of-the-art FID score from 11.48 to 9.21. Finally, on ImageNet-2012, we apply our technique to the original BigGAN (Brock, Donahue, and Simonyan 2019) model and improve the FID from 6.66 to 5.38, which is the best score at that model size. View details
    PseudoSeg: Designing Pseudo Labels for Semantic Segmentation
    Yuliang Zou
    Han Zhang
    Chun-Liang Li
    Xiao Bian
    Jia-Bin Huang
    International Conference on Learning Representations (ICLR) (2021)
    Preview abstract Recent advances in semi-supervised learning (SSL) demonstrate that a combination of consistency regularization and pseudo-labeling can effectively improve image classification accuracy in the low-data regime. Compared to classification, semantic segmentation tasks require much more intensive labeling costs. Thus, these tasks greatly benefit from data-efficient training methods. However, structured outputs in segmentation render particular difficulties (e.g., designing pseudo-labeling and augmentation) to apply existing SSL strategies. To address this problem, we present a simple and novel re-design of pseudo-labeling to generate well-calibrated structured pseudo labels for training with unlabeled or weakly-labeled data. Our proposed pseudo-labeling strategy is network structure agnostic to apply in a one-stage consistency training framework. We demonstrate the effectiveness of the proposed pseudo-labeling strategy in both low-data and high-data regimes. Extensive experiments have validated that pseudo labels generated from wisely fusing diverse sources and strong data augmentation are crucial to consistency training for segmentation. The source code is available at https://github.com/googleinterns/wss. View details
    Preview abstract Re-weighting training samples has been shown as an effective and practical approach to tackle data biases such as imbalance and corrupted labels. Recent methods develop learning based algorithms to learn re-weighting strategies jointly with model training, in light of reinforcement learning and meta learning. However, the dependence of additional unbiased reward data is known an undesirable limitation. Furthermore, existing learning based sample weighting methods maintain inner and outer optimization for model and weighting parameters, respectively, such that training requires expensive optimization. This paper aims to address these two problems and presents a new learning based fast sample re-weighting (FSR) method without reward data. The method is based on two key ideas, a) learning from history as dictionary fetch and b) feature sharing. Without the dependence of constructing extra reward datasets, we can easily incorporate FSR with additionally proposed task-specific components and test on label noise robust and long-tailed recognition benchmarks. Our experiments show the proposed method achieves competitive results to state-of-the-art methods in respective tasks and significantly improved training efficiency. Source code will be released. View details
    Improved Transformer for High-Resolution GANs
    Ting Chen
    Dimitris N. Metaxas
    Han Zhang
    Advances in Neural Information Processing Systems (NeurIPS) (2021)
    Preview abstract Attention-based models, exemplified by the Transformer, can effectively model long range dependency, but suffer from the quadratic complexity of self-attention operation, making them difficult to be adopted for high-resolution image generation based on Generative Adversarial Networks (GANs). In this paper, we introduce two key ingredients to Transformer to address this challenge. First, in low-resolution stages of the generative process, standard global self-attention is replaced with the proposed multi-axis blocked self-attention which allows efficient mixing of local and global attention. Second, in high-resolution stages, we drop self-attention while only keeping multi-layer perceptrons reminiscent of the implicit neural function. To further improve the performance, we introduce an additional self-modulation component based on cross-attention. The resulting model, denoted as HiT, has a nearly linear computational complexity with respect to the image size and thus directly scales to synthesizing high definition images. We show in the experiments that the proposed HiT achieves state-of-the-art FID scores of 31.87 and 2.95 on unconditional ImageNet 128x128 and FFHQ 256x256, respectively, with a reasonable throughput. We believe the proposed HiT is an important milestone for generators in GANs which are completely free of convolutions. View details
    Preview abstract Deep neural networks (DNNs) yield poorly-calibrated confidence estimates when their raw predicted posterior estimates are considered. Towards obtaining perfectly-calibrated confidence estimates, we propose a novel framework, named as `Distance-Based Learning from Errors' (DBLE). DBLE is based on two fundamental principles: (i) learning a representation space where distances correspond to relatedness of samples, and (ii) efficient feedback from the training errors to accurately model distances to ground truth centroids. For (i), we adapt prototypical learning such that pairwise distances determine the predicted posteriors during training, and the related samples, ideally from the same class, should be grouped together. For (ii), we propose a simple yet effective solution of relying updates on the samples that yielded the inaccurate decisions during training, with the goal of efficiently fitting a model that represents the variance of prediction in the decision manifold. On four datasets, we demonstrate that DBLE significantly outperforms alternative approaches that are based on a single DNN, in confidence calibration. DBLE is on par with ensemble approaches that contain multiple DNNs, without even doubling the training time and yielding negligible increase in the number of parameters. View details
    Consistency Regularization for Generative Adversarial Networks
    Augustus Odena
    Han Zhang
    Honglak Lee
    International Conference on Learning Representations (2020)
    Preview abstract Generative Adversarial Networks are plagued by training instability, despite considerable research effort. Progress has been made on this topic, but many of the proposed interventions are complicated, computationally expensive, or both. In this work, we propose a simple and effective training stabilizer based on the notion of Consistency Regularization - a popular technique in the Semi-Supervised Learning literature. In particular, we augment data passing into the GAN discriminator and penalize the sensitivity of the penultimate layer of the discriminator to these augmentations. This regularization increases the robustness of the discriminator to input perturbations and demonstrably reduces memorization of the training data. We conduct a series of ablation studies to demonstrate that consistency regularization is compatible with various GAN architectures and loss functions. Finally, we show that applying consistency regularization to GANs improves state-of-the-art FID scores on the ImageNet-2012 data set. Our code is open-sourced at \textbf{URL blinded for peer review}. View details
    Preview abstract Active learning (AL) combines data labeling and model training to minimize the labeling cost by prioritizing the selection of high value data that can best improve model performance. In pool-based active learning, accessible unlabeled data are not used for model training in most conventional methods. Here, we propose to unify unlabeled sample selection and model training towards minimizing labeling cost, and make two contributions towards that end. First, we exploit both labeled and unlabeled data using semi-supervised learning (SSL) to distill information from unlabeled data during the training stage. Second, we propose a consistency-based sample selection metric that is coherent with the training objective such that the selected samples are effective at improving model performance. We conduct extensive experiments on image classification tasks. The experimental results on CIFAR-10, CIFAR-100 and ImageNet demonstrate the superior performance of our proposed method with limited labeled data, compared to the existing methods and the alternative AL and SSL combinations. Additionally, we also study an important yet under-explored problem – “When can we start learning-based AL selection?”. We propose a measure that is empirically correlated with the AL target loss and is potentially useful for determining the proper starting point of learning-based AL methods View details
    Preview abstract Semi-supervised learning (SSL) provides an effective means of leveraging unlabeled data to improve a model’s performance. This domain has seen fast progress recently, at the cost of requiring more complex methods. In this paper we proposeFixMatch, an algorithm that is a significant simplification of existing SSL methods.FixMatch first generates pseudo-labels using the model’s predictions on weakly-augmented unlabeled images. For a given image, the pseudo-label is only retained if the model produces a high-confidence prediction. The model is then trained to predict the pseudo-label when fed a strongly-augmented version of the same image. Despite its simplicity, we show that FixMatch achieves state-of-the-art performance across a variety of standard semi-supervised learning benchmarks, including 94.93% accuracy on CIFAR-10 with 250 labels and 88.61% accuracy with 40 – just 4 labels per class. We carry out an extensive ablation study to tease apart the experimental factors that are most important to FixMatch’s success View details
    Image Augmentations for GAN Training
    Zhengli Zhao
    Ting Chen
    Sameer Singh
    Han Zhang
    arXiv preprint arXiv:2006.02595 (2020)
    Preview abstract Data augmentations have been widely studied to improve the accuracy and robustness of classifiers. However, there exist very few papers that thoroughly investigated the potential of image augmentation in improving GAN models for image synthesis. In this work, we systematically study the effectiveness of various existing augmentation techniques for GAN training in multiple settings. We provide insights and guidelines on how to augment images for both vanilla GANs and GANs with regularizations, improving the fidelity of the generated images substantially. Surprisingly, we find merely augmenting real and generated images for GANs can result in generation quality on par with recent state-of-the-art results. We further compare this with some commonly used regularization methods where augmentations are the essential component. We observe adding regularization on top of augmentation can always improve image quality. We also achieve new state-of-the-art results for conditional generation on CIFAR-10 with consistency loss and contrastive loss as additional regularizations. View details
    Preview abstract Collecting large-scale data with clean labels for supervised training of neural networks is practically challenging. Although noisy labels are usually cheap to acquire, existing methods suffer a lot from label noise. This paper targets at the challenge of robust training at high label noise regimes. The key insight to achieve this goal is to wisely leverage a small trusted set to estimate exemplar weights and pseudo labels for noisy data in order to reuse them for supervised training. We present a holistic framework to train deep neural networks in a way that is highly invulnerable to label noise. Our method sets the new state of the art on various types of label noise and achieves leading performance on large-scale datasets with real-world label noise. For instance, on CIFAR100 with a 40% uniform noise ratio and only 10 trusted labeled data per class, our method achieves 80.2% classification accuracy, where the error rate is only 1.4% higher than a neural network trained without label noise. Moreover, increasing the noise ratio to 80%, our method still maintains a high accuracy of 75.5%, compared to the previous best accuracy 48.2%. View details
    No Results Found