Zizhao Zhang
Research Areas
Authored Publications
Sort By
CodecLM: Aligning Language Models with Tailored Synthetic Data
Chun-Liang Li
Jin Miao
NAACL 2024
Preview abstract
Instruction tuning has emerged as the key in aligning large language models (LLMs) with specific task instructions, thereby mitigating the discrepancy between the next-token prediction objective and users' actual goals. To reduce the labor and time cost to collect or annotate data by humans, researchers start to explore the use of LLMs to generate instruction-aligned synthetic data. Recent works focus on generating diverse instructions and applying LLM to increase instruction complexity, often neglecting downstream use cases. It remains unclear how to tailor high-quality data to elicit better instruction-following abilities in different target instruction distributions and LLMs. To this end, we introduce CodecLM, a general framework for adaptively generating high-quality synthetic data for LLM alignment with different downstream instruction distributions and LLMs. Drawing on the Encode-Decode principles, we use LLMs as codecs to guide the data generation process. We first encode seed instructions into metadata, which are concise keywords generated on-the-fly to capture the target instruction distribution, and then decode metadata to create tailored instructions. We also introduce Self-Rubrics and Contrastive Filtering during decoding to tailor data-efficient samples. Extensive experiments on four open-domain instruction following benchmarks validate the effectiveness of CodecLM over the current state-of-the-arts.
View details
QueryForm: A Simple Zero-shot Form Entity Query Framework
Jacob Devlin
Hao Zhang
Jennifer Dy
ACL (2023)
Preview abstract
Zero-shot transfer learning for document understanding is a crucial yet under-investigated scenario to help reduce the high cost involved in annotating document entities. We present a novel query-based framework, QueryForm, that extracts entity values from form-like documents in a zero-shot fashion. QueryForm contains a dual prompting mechanism that composes both the document schema and a specific entity type into a query, which is used to prompt a Transformer model to perform a single entity extraction task. Furthermore, we propose to leverage large-scale query-entity pairs generated from form-like webpages with weak HTML annotations to pre-train QueryForm. By unifying pre-training and fine-tuning into the same query-based framework, QueryForm enables models to learn from structured documents containing various entities and layouts, leading to better generalization to target document types without the need for target-specific training data. QueryForm sets new state-of-the-art average F1 score on both the XFUND (+4.6%~10.1%) and the Payment (+3.2%~9.5%) zero-shot benchmark, with a smaller model size and no additional image input.
View details
Preview abstract
While remarkable progress has been made in imbalanced supervised learning, less attention has been given to the setting of imbalanced semi-supervised learning (SSL) where not only are few labeled data provided, but the underlying data distribution can be severely imbalanced. Recent work requires both complicated sampling strategies of pseudo-labeled unlabeled data and distribution alignment of the pseudo-label distribution to accommodate this imbalance. We present a novel approach that relies only on a form of a distribution alignment but no sampling strategy where rather than aligning the pseudo-labels during inference, we move the distribution alignment component into the respective cross entropy loss computations for both the supervised and unsupervised losses. This alignment compensates for both imbalance in the data and the eventual distributional shift present during evaluation. Altogether, this provides a unified strategy that offers both significantly reduced training requirements and improved performance across both low and richly labeled regimes and over varying degrees of imbalance. In experiments, we validate the efficacy of our method on SSL variants of CIFAR10-LT, CIFAR100-LT, and ImageNet-127. On ImageNet-127, our method shows 1.6% accuracy improvement over CReST with an 80% training time reduction and is competitive with other SOTA methods.
View details
Learning Sub-Pseudo Labels from Weakly-Annotated Web Data for Video Action Recognition
Kunpeng Li
Guanhang Wu
Xuehan Xiong
Zhichao Lu
Yun Fu
AAAI 2022
Preview abstract
Learning visual knowledge from massive weakly-labeled web videos has attracted growing research interests thanks
to the large corpus of easily accessible video data on the Internet. However, for video action recognition, the action of interest might only exist in arbitrary clips of untrimmed web
videos, resulting in high label noises in the temporal space. To address this issue, we introduce a new method for pretraining video action recognition models using queried web
videos. Instead of trying to filter out, we propose to convert the potential noises in these queried videos to useful supervision signals by defining the concept of Sub-Pseudo
Label (SPL). Specifically, SPL spans out a new set of meaningful “middle ground” label space constructed by extrapolating the original weak labels during video querying and
the prior knowledge distilled from a teacher model. Consequently, SPL provides enriched supervision for video models to learn better representations. SPL is fairly simple and
orthogonal to popular teacher-student self-training frameworks without extra training cost. We validate the effectiveness of our method on four video action recognition
datasets and a weakly-labeled image dataset to study the generalization ability. Experiments show that SPL outperforms several existing pre-training strategies using pseudolabels and the learned representations lead to competitive results when fine-tuning on HMDB-51 and UCF-101 compared with recent pre-training methods
View details
Nested Hierarchical Transformer: Towards Accurate, Data-Efficient and Interpretable Visual Understanding
Han Zhang
Ting Chen
AAAI Conference on Artificial Intelligence (AAAI), 2022
Preview abstract
Hierarchical structures are popular in recent vision transformers, however, they require sophisticated designs and massive datasets to work well.
In this paper, we explore the idea of nesting basic local transformers on non-overlapping image blocks and aggregating them in a hierarchical way.
We find that the block aggregation function plays a critical role in enabling cross-block non-local information communication.
This observation leads us to design a simplified architecture that requires minor code changes upon the original vision transformer.
The benefits of the proposed judiciously-selected design are threefold:
(1) NesT converges faster and requires much less training data to achieve good generalization on both ImageNet and small datasets like CIFAR;
(2) when extending our key ideas to image generation, NesT leads to a strong decoder that is 8$\times$ faster than previous transformer-based generators; and
(3) we show that decoupling the feature learning and abstraction processes via this nested hierarchy in our design enables constructing a novel method (named GradCAT) for visually interpreting the learned model. Source code is available https://github.com/google-research/nested-transformer.
View details
Learning to prompt for continual learning
Han Zhang
Xiaoqi Ren
Jennifer Dy
CVPR2022
Preview abstract
The mainstream paradigm behind continual learning has been to adapt the model parameters to non-stationary data distributions, where catastrophic forgetting is the central challenge. Typical methods rely on a rehearsal buffer or known task identity at test time to retrieve learned knowledge and address forgetting, while this work presents a new paradigm for continual learning that aims to train a more succinct memory system without accessing task identity at test time. Our method learns to dynamically prompt (L2P) a pre-trained model to learn tasks sequentially under different task transitions. In our proposed framework, prompts are small learnable parameters, which are maintained in a memory space. The objective is to optimize prompts to instruct the model prediction and explicitly manage task-invariant and task-specific knowledge while maintaining model plasticity. We conduct comprehensive experiments under popular image classification benchmarks with different challenging continual learning settings, where L2P consistently outperforms prior state-ofthe-art methods. Surprisingly, L2P achieves competitive results against rehearsal-based methods even without a rehearsal buffer and is directly applicable to challenging taskagnostic continual learning. Source code is available at https://github.com/google-research/l2p.
View details
DualPrompt: Complementary Prompting for Rehearsal-free Continual Learning
Han Zhang
Xiaoqi Ren
Jennifer Dy
ECCV 2022
Preview abstract
Continual learning aims to enable a single model to learn a sequence of tasks without catastrophic forgetting. Top-performing methods usually require a rehearsal buffer to store past pristine examples for experience replay, which, however, limits their practical value due to privacy and memory constraints. In this work, we present a simple yet effective framework, DualPrompt, which learns a tiny set of parameters, called prompts, to properly instruct a pre-trained model to learn tasks arriving sequentially without buffering past examples. DualPrompt presents a novel approach to attach complementary prompts to the pre-trained backbone, and then formulates the objective as learning task-invariant and task-specific "instructions". With extensive experimental validation, DualPrompt consistently sets state-of-the-art performance under the challenging class-incremental setting. In particular, DualPrompt outperforms recent advanced continual learning methods with relatively large buffer sizes. We also introduce a more challenging benchmark, Split ImageNet-R, to help generalize rehearsal-free continual learning research. Source code is available at https://github.com/google-research/l2p.
View details
Preview abstract
Re-weighting training samples has been shown as an effective and practical approach to tackle data biases such as imbalance and corrupted labels. Recent methods develop learning based algorithms
to learn re-weighting strategies jointly with model training, in light of reinforcement learning and meta learning. However, the dependence of additional unbiased reward data is known an undesirable limitation. Furthermore, existing learning based sample weighting methods maintain inner and outer optimization for model and weighting parameters, respectively, such that training requires expensive optimization. This paper aims to address these two problems and presents a new learning based fast sample re-weighting (FSR) method without reward data. The method is based on two key ideas, a) learning from history as dictionary fetch and b) feature sharing. Without the dependence of constructing extra reward datasets, we can easily incorporate FSR with additionally proposed task-specific components and test on label noise robust and long-tailed recognition benchmarks. Our experiments show the proposed method achieves competitive results to state-of-the-art methods in respective tasks and significantly improved training efficiency. Source code will be released.
View details
Improved Consistency Regularization for GANs
Zhengli Zhao
Sameer Singh
Honglak Lee
Augustus Odena
Han Zhang
Proceedings of the AAAI Conference on Artificial Intelligence (2021)
Preview abstract
Recent work (Zhang et al. 2020) has increased the performance of Generative Adversarial Networks (GANs) by enforcing a consistency cost on the discriminator. We improve on this technique in several ways. We first show that consistency regularization can introduce artifacts into the GAN samples and explain how to fix this issue. We then propose several modifications to the consistency regularization procedure designed to improve its performance. We carry out extensive experiments quantifying the benefit of our improvements. For unconditional image synthesis on CIFAR-10 and CelebA, our modifications yield the best known FID scores on various GAN architectures. For conditional image synthesis on CIFAR-10, we improve the state-of-the-art FID score from 11.48 to 9.21. Finally, on ImageNet-2012, we apply our technique to the original BigGAN (Brock, Donahue, and Simonyan 2019) model and improve the FID from 6.66 to 5.38, which is the best score at that model size.
View details
Improved Transformer for High-Resolution GANs
Ting Chen
Dimitris N. Metaxas
Han Zhang
Advances in Neural Information Processing Systems (NeurIPS) (2021)
Preview abstract
Attention-based models, exemplified by the Transformer, can effectively model long range dependency, but suffer from the quadratic complexity of self-attention operation, making them difficult to be adopted for high-resolution image generation based on Generative Adversarial Networks (GANs). In this paper, we introduce two key ingredients to Transformer to address this challenge. First, in low-resolution stages of the generative process, standard global self-attention is replaced with the proposed multi-axis blocked self-attention which allows efficient mixing of local and global attention. Second, in high-resolution stages, we drop self-attention while only keeping multi-layer perceptrons reminiscent of the implicit neural function. To further improve the performance, we introduce an additional self-modulation component based on cross-attention. The resulting model, denoted as HiT, has a nearly linear computational complexity with respect to the image size and thus directly scales to synthesizing high definition images. We show in the experiments that the proposed HiT achieves state-of-the-art FID scores of 31.87 and 2.95 on unconditional ImageNet 128x128 and FFHQ 256x256, respectively, with a reasonable throughput. We believe the proposed HiT is an important milestone for generators in GANs which are completely free of convolutions.
View details