Jump to Content
Kihyuk Sohn

Kihyuk Sohn

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    VideoPoet: A Large Language Model for Zero-Shot Video Generation
    Lijun Yu
    Xiuye Gu
    Rachel Hornung
    Hassan Akbari
    Ming-Chang Chiu
    Josh Dillon
    Agrim Gupta
    Meera Hahn
    Anja Hauth
    David Hendon
    Alonso Martinez
    Grant Schindler
    Huisheng Wang
    Jimmy Yan
    Xuan Yang
    Lu Jiang
    arxiv Preprint (2023) (to appear)
    Preview abstract We present VideoPoet, a language model capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs -- including images, videos, text, and audio. The training protocol follows that of Large Language Models (LLMs), consisting of two stages: pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal generative objectives within an autoregressive Transformer framework. The pretrained LLM serves as a foundation that can be adapted for a range of video generation tasks. We present empirical results demonstrating the model's state-of-the-art capabilities in zero-shot video generation, specifically highlighting VideoPoet's ability to generate high-fidelity motions. Project page: http://sites.research.google/videopoet/ View details
    Preview abstract Semi-supervised anomaly detection is a common problem, as often the datasets containing anomalies are partially labeled. We propose a canonical framework: Semi-supervised Pseudo-labeler Anomaly Detection with Ensembling (SPADE) that isn't limited by the assumption that labeled and unlabeled data come from the same distribution. Indeed, the assumption is often violated in many applications -- for example, the labeled data may contain only anomalies unlike unlabeled data, or unlabeled data may contain different types of anomalies, or labeled data may contain only `easy-to-label' samples. SPADE utilizes an ensemble of one class classifiers as the pseudo-labeler to improve the robustness of pseudo-labeling with distribution mismatch. Partial matching is proposed to automatically select the critical hyper-parameters for pseudo-labeling without validation data, which is crucial with limited labeled data. SPADE shows state-of-the-art semi-supervised anomaly detection performance across a wide range of scenarios with distribution mismatch in both tabular and image domains. In some common real-world settings such as model facing new types of unlabeled anomalies, SPADE outperforms the state-of-the-art alternatives by 5% AUC in average. View details
    Preview abstract Transferring knowledge from an image synthesis model trained on a large dataset is a promising direction for learning generative image models from various domains efficiently. While previous works have studied GAN models, we present a recipe for learning vision transformers by generative knowledge transfer. We base our framework on state-of-the-art generative vision transformers that represent an image as a sequence of visual tokens to the autoregressive or non-autoregressive transformers. To adapt to a new domain, we employ prompt tuning, which prepends learnable tokens called prompt to the image token sequence, and introduce a new prompt design for our task. We study on a variety of visual domains, including visual task adaptation benchmark, with varying amount of training images, and show effectiveness of knowledge transfer and a significantly better image generation quality over existing works. View details
    Preview abstract Vision-language contrastive learning suggests a new learning paradigm by leveraging a large amount of image-caption-pair data. The caption supervision excels at providing wide coverage in vocabulary that enables strong zero-shot image recognition performance. On the other hand, label supervision offers to learn more targeted visual representations that are label-oriented and can cover rare categories. To gain the complementary advantages of both kinds of supervision for contrastive image-caption pre-training, recent works have proposed to convert class labels into a sentence with pre-defined templates called prompts. However, a naive unification of the real caption and the prompt sentences could lead to a complication in learning, as the distribution shift in text may not be handled properly in the language encoder. In this work, we propose a simple yet effective approach to unify these two types of supervision using prefix tokens that inform a language encoder of the type of the input sentence (e.g., caption or prompt) at training time. Our method is generic and can be easily integrated into existing VL pre-training objectives such as CLIP or UniCL. In experiments, we show that this simple technique dramatically improves the performance in zero-shot image recognition accuracy of the pre-trained model. View details
    Preview abstract While remarkable progress has been made in imbalanced supervised learning, less attention has been given to the setting of imbalanced semi-supervised learning (SSL) where not only are few labeled data provided, but the underlying data distribution can be severely imbalanced. Recent work requires both complicated sampling strategies of pseudo-labeled unlabeled data and distribution alignment of the pseudo-label distribution to accommodate this imbalance. We present a novel approach that relies only on a form of a distribution alignment but no sampling strategy where rather than aligning the pseudo-labels during inference, we move the distribution alignment component into the respective cross entropy loss computations for both the supervised and unsupervised losses. This alignment compensates for both imbalance in the data and the eventual distributional shift present during evaluation. Altogether, this provides a unified strategy that offers both significantly reduced training requirements and improved performance across both low and richly labeled regimes and over varying degrees of imbalance. In experiments, we validate the efficacy of our method on SSL variants of CIFAR10-LT, CIFAR100-LT, and ImageNet-127. On ImageNet-127, our method shows 1.6% accuracy improvement over CReST with an 80% training time reduction and is competitive with other SOTA methods. View details
    Preview abstract Recent conditional image generation methods produce images of remarkable diversity, fidelity and realism. However, the majority of these methods allow conditioning only on labels or text prompts, which limits their level of control over the generation result. In this paper, we introduce MaskSketch, a masked image generation method that allows spatial conditioning of the generation result, using a guiding sketch as an extra conditioning signal during sampling. MaskSketch utilizes a pre-trained masked image generator, requires no model training or paired supervision, and works with input sketches of different levels of abstraction. We propose a novel parallel sampling scheme that leverages the structural information encoded in the intermediate self-attention maps of a masked generative transformer, such as scene layout and object shape. Our results show that MaskSketch achieves high image realism and fidelity to the guiding structure. Evaluated on standard benchmark datasets, MaskSketch outperforms state-of-the-art methods for sketch-to-image translation, as well as generic image-to-image translation approaches. View details
    Preview abstract Despite the remarkable progress in deep generative models, synthesizing highresolution and temporally coherent videos still remains a challenge due to their highdimensionality and complex temporal dynamics along with large spatial variations. Recent works on diffusion models have shown their potential to solve this challenge, yet they suffer from severe computation inefficiency for generation and thus limit the scalability. To handle this issue, we propose a novel generative model for videos, coined projected latent video diffusion model (PVDM), a probabilistic diffusion model which learns a video distribution in a low-dimensional latent space. Specifically, PVDM is composed of two components: (a) an autoencoder that projects a given video as 2D-shaped latent vectors that factorize the complex cubic structure of video pixels and (b) a diffusion model architecture specialized for our new factorized latent space and the training/sampling procedure to synthesize videos of arbitrary length with a single model. Experiments on various benchmarks demonstrate the effectiveness of PVDM compared with previous video generation methods; e.g., PVDM obtains the FVD score of 548.1 on UCF-101, a 61.7% improved result compared with 1431.0 of the prior state-of-the-art. View details
    Preview abstract The recent advent of self-supervised pre-training techniques has led to a surge in the use of multimodal learning in form document understanding. However, existing approaches that extend the mask language modeling to other modalities require careful multi-task tuning, complex reconstruction target designs, or additional pre-training data. In FormNetV2, we introduce a centralized multimodal graph contrastive learning strategy to unify self-supervised pre-training for all modalities in one loss. The graph contrastive objective maximizes the agreement of multimodal representations, providing a natural interplay for all modalities without special customization. In addition, we extract image features within the bounding box that joins a pair of tokens connected by a graph edge, capturing more targeted visual cues without loading a sophisticated and separately pre-trained image embedder. FormNetV2 establishes new state-of-the-art performance on FUNSD, CORD, SROIE and Payment benchmarks with a more compact model size. View details
    Preview abstract This paper introduces a Masked Generative Video Transformer, named MAGVIT, for multi-task video generation. We train a single MAGVIT model and apply it to multiple video generation tasks at inference time. To this end, two new designs are proposed: an improved 3D tokenizer model to quantize a video into spatial-temporal visual tokens, and a novel technique to embed conditions inside the mask to facilitate multi-task training. We conduct extensive experiments to demonstrate the compelling quality, efficiency, and flexibility of the proposed model. First, MAGVIT radically improves the previous best fidelity on two video generation tasks. In terms of efficiency, MAGVIT offers leading video generation speed at inference time, which is estimated to be one or two orders-of-magnitudes faster than other models. As for flexibility, we verified that a single trained MAGVIT is able to generically perform 8+ tasks at several video benchmarks from drastically different visual domains. We will open source our framework and models. View details
    Preview abstract We study anomaly clustering, grouping data into coherent clusters of anomaly types. This is different from anomaly detection that aims to divide anomalies from normal data.Unlike object-centered image clustering, anomaly clustering is particularly challenging as anomalous patterns are subtle and local. We present a simple yet effective clustering framework using a patch-based pretrained deep embeddings and off-the-shelf clustering methods. We define a distance function between images, each of which is represented as a bag of embeddings, by the Euclidean distance between weighted averaged embeddings. The weight defines the importance of instances (i.e., patch embeddings) in the bag, which may highlight defective regions. We compute weights in an unsupervised way or in a semi-supervised way when labeled normal data is available. Extensive experimental studies show the effectiveness of the proposed clustering framework along with a novel distance function upon existing multiple instance or deep clustering frameworks. Overall, our framework achieves 0.451 and 0.674 normalized mutual information scores on MVTec object and texture categories and further improve with a few labeled normal data(0.577, 0.669), far exceeding the baselines (0.244, 0.273)or state-of-the-art deep clustering methods (0.176, 0.277). View details
    Preview abstract In Composed Image Retrieval (CIR), a user combines a query image with text to describe their intended target. Existing methods rely on supervised learning of CIR models using labeled triplets consisting of the query image, text specification, and the target image. Labeling such triplets is expensive and hinders broad applicability of CIR. In this work, we propose to study an important task, Zero-Shot Composed Image Retrieval (ZS-CIR), whose goal is to build a CIR model without requiring labeled triplets for training. To this end, we propose a novel method, called Pic2Word, that requires only weakly labeled image-caption pairs and unlabeled image datasets to train. Unlike existing supervised CIR models, our model trained on weakly labeled or unlabeled datasets shows strong generalization across diverse ZS-CIR tasks, e.g., attribute editing, object composition, and domain conversion. Our approach outperforms several supervised CIR methods on the common CIR benchmark, CIRR and Fashion-IQ. View details
    Preview abstract We extend semi-supervised learning to the problem of domain adaptation to learn significantly higher-accuracy models that train on one data distribution and test on a different one. With the goal of generality, we introduce AdaMatch, a method that unifies the tasks of unsupervised domain adaptation (UDA), semi-supervised learning (SSL), and semi-supervised domain adaptation (SSDA). In an extensive experimental study, we compare its behavior with respective state-of-the-art techniques from SSL, SSDA, and UDA on vision classification tasks. We find AdaMatch either matches or significantly exceeds the state-of-the-art in each case using the same hyper-parameters regardless of the dataset or task. For example, AdaMatch nearly doubles the accuracy compared to that of the prior state-of-the-art on the UDA task for DomainNet and even exceeds the accuracy of the prior state-of-the-art obtained with pre-training by 6.4% when AdaMatch is trained completely from scratch. Furthermore, by providing AdaMatch with just one labeled example per class from the target domain (i.e., the SSDA setting), we increase the target accuracy by an additional 6.1%, and with 5 labeled examples, by 13.6%. View details
    Preview abstract Anomaly detection (AD), separating anomalies from normal data, has many applications across domains, from security to healthcare. While most previous works were shown to be effective for cases with fully or partially labeled data, that setting is in practice less common due to labeling being particularly tedious for this task. In this paper, we focus on fully unsupervised AD, in which the entire training dataset, containing both normal and anomalous samples, is unlabeled. To tackle this problem effectively, we propose to improve the robustness of one-class classification trained on self-supervised representations using a data refinement process. Our proposed data refinement approach is based on an ensemble of one-class classifiers (OCCs), each of which is trained on a disjoint subset of training data. Representations learned by self-supervised learning on the refined data are iteratively updated as the data refinement improves. We demonstrate our method on various unsupervised AD tasks with image and tabular data. With a 10% anomaly ratio on CIFAR-10 image data / 2.5% anomaly ratio on Thyroid tabular data, the proposed method outperforms the state-of-the-art one-class classifier by 6.3 AUC and 12.5 average precision / 22.9 F1-score. View details
    Preview abstract We present a two-stage framework for deep one-class classification, where in the first stage, we learn self-supervised deep representations from one-class data, and then we build a classifier using generative or discriminative models on learned representations. In particular, we present a novel distribution-augmented contrastive learning by extending training distributions via data augmentation to obstruct the uniformity of vanilla contrastive representations, yielding more suitable representations for one-class classification. Moreover, we argue that classifiers inspired by the statistical perspective in generative or discriminative ways are more effective than existing approaches, such as an average of normality scores from a surrogate classifier. In experiments, we demonstrate state-of-the-art performance on visual domain one-class classification benchmarks. Not only learning a better representation, the proposed framework permits building one-class classifiers more faithful to the target task. Finally, we present visual explanations, confirming that the decision making process of our deep one-class classifier is human-intuitive. View details
    i-Mix: A Domain-Agnostic Strategy for Contrastive Representation Learning
    Kibok Lee
    Yian Zhu
    Chun-Liang Li
    Jinwoo Shin
    Honglak Lee
    ICLR 2021 (to appear)
    Preview abstract Contrastive representation learning has shown to be effective to learn representations from unlabeled data. However, much progress has been made in vision domains relying on data augmentations carefully designed using domain knowledge. In this work, we propose i-Mix, a simple yet effective domain-agnostic regularization strategy for improving contrastive representation learning. We cast contrastive learning as training a non-parametric classifier by assigning a unique virtual class to each data in a batch. Then, data instances are mixed in both the input and virtual label spaces, providing more augmented data during training. In experiments, we demonstrate that i-Mix consistently improves the quality of learned representations across domains, including image, speech, and tabular data. Furthermore, we confirm its regularization effect via extensive ablation studies across model and dataset sizes. The code is available at https://github.com/kibok90/imix. View details
    Preview abstract We propose a novel training method that integrates rules into deep learning, in a way the strengths of the rules are controllable at inference. Deep Neural Networks with Controllable Rule Representations (DeepCTRL) incorporates a rule encoder into the model coupled with a rule-based objective, enabling a shared representation for decision making. DeepCTRL is agnostic to data type and model architecture. It can be applied to any kind of rule defined for inputs and outputs. The key aspect of DeepCTRL is that it does not require retraining to adapt the rule strength -- at inference, the user can adjust it based on the desired operation point on accuracy vs. rule verification ratio. In real-world domains where incorporating rules is critical -- such as Physics, Retail and Healthcare -- we show the effectiveness of DeepCTRL in teaching rules for deep learning. DeepCTRL improves the trust and reliability of the trained models by significantly increasing their rule verification ratio, while also providing accuracy gains at downstream tasks. Additionally, DeepCTRL enables novel use cases such as hypothesis testing of the rules on data samples, and unsupervised adaptation based on shared rules between datasets. View details
    Preview abstract In this work, we aim at constructing a high performance model for defect detection that detects unknown anomalous patterns of an image without anomalous data. To this end, we propose a simple two-stage framework for building anomaly detectors using normal training data only, where we first learn self-supervised deep representations and then build a generative one-class classifier on learned representations. We learn representations by classifying normal data from the CutPaste, a simple data augmentation strategy that cuts an image patch and pastes at random location of a large image. Our empirical study on MVTec anomaly detection database demonstrates the proposed algorithm is general to detecting various types of real-world defects. We bring the improvement upon previous arts by 3 AUCs when learning representations from scratch. By transfer learning representations from an ImageNet pretrained model, we achieve a new state-of-the-art 96.6 AUC. Lastly, we extend the framework to learn and extract representations from patches to allow localization of defective areas without the need of annotation. View details
    Preview abstract Semi-supervised Learning (SSL) on unbalanced data has been under-studied. In this work, we observe standard SSL algorithms are biased towards majority classes and produces low recall on minority classes. However, they can generate highly accurate pseudo-labels on minority classes that are not fully utilized yet. This motivates us to propose a simple yet effective algorithm, named Class-Rebalancing Self-Training (CReST), to improve existing SSL algorithms on unbalanced data. The proposed CReST algorithm iteratively retrains a baseline SSL model with a dynamic labeled set expanded by adding pseudo-labeled samples from unlabeled set, where pseudo-labeled samples from minority classes are added more frequently based on the estimated class distribution. The SSL model is also equipped with an adaptive distribution alignment strategy. We show that CReST improves the state-of-the-art FixMatch on various unbalanced datasets by as much as 11.8%, and outperforms other popular rebalancing algorithms consistently. CReST is an easy-to-use component that can be plugged into any SSL algorithms to improve their capability of handling data imbalance issues. View details
    Preview abstract We improve the recently-proposed ``MixMatch'' semi-supervised learning algorithm by introducing two new techniques: distribution alignment and augmentation anchoring. Distribution alignment encourages the marginal distribution of predictions on unlabeled data to be close to the marginal distribution of groundtruth labels. Augmentation anchoring feeds multiple strongly augmented versions of an input into the model and encourages each output to be close to the prediction for a weakly-augmented version of the same input. To produce strong augmentations, we propose a variant of AutoAugment which learns the augmentation policy while the model is being trained. Our new algorithm, dubbed ReMixMatch, is significantly more data-efficient than prior work, requiring between 5x and 16x less data to reach the same accuracy. For example, on CIFAR10 with 250 labeled examples we reach 93.73% accuracy (compared to MixMatch’s accuracy of 93.58% with 4,000 examples) and a median accuracy of 84.92% with just four labels per class. We make our code and data open-source at https://github.com/google-research/remixmatch. View details
    Preview abstract Semi-supervised learning (SSL) provides an effective means of leveraging unlabeled data to improve a model’s performance. This domain has seen fast progress recently, at the cost of requiring more complex methods. In this paper we proposeFixMatch, an algorithm that is a significant simplification of existing SSL methods.FixMatch first generates pseudo-labels using the model’s predictions on weakly-augmented unlabeled images. For a given image, the pseudo-label is only retained if the model produces a high-confidence prediction. The model is then trained to predict the pseudo-label when fed a strongly-augmented version of the same image. Despite its simplicity, we show that FixMatch achieves state-of-the-art performance across a variety of standard semi-supervised learning benchmarks, including 94.93% accuracy on CIFAR-10 with 250 labels and 88.61% accuracy with 40 – just 4 labels per class. We carry out an extensive ablation study to tease apart the experimental factors that are most important to FixMatch’s success View details
    No Results Found