David Minnen
Research Areas
Authored Publications
Sort By
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
Nitesh Bharadwaj Gundavarapu
Luca Versari
Kihyuk Sohn
Agrim Gupta
Xiuye Gu
Alex Hauptmann
Boqing Gong
Lu Jiang
ICLR (2024)
Preview abstract
While Large Language Models (LLMs) are the dominant models for generative tasks in language, they do not perform as well as diffusion models on image and video generation. To effectively use LLMs for visual generation, one crucial component is the visual tokenizer that maps pixel-space inputs to discrete tokens appropriate for LLM learning. In this paper, we introduce MAGVIT-v2, a video tokenizer designed to generate concise and expressive tokens for both videos and images using a common token vocabulary. Equipped with this new tokenizer, we show that LLMs outperform diffusion models on standard image and video generation benchmarks including ImageNet and Kinetics. In addition, we demonstrate that our tokenizer surpasses the previously top-performing video tokenizer on two more tasks: (1) video compression comparable to the next-generation video codec (VCC) according to human evaluations, and (2) learning effective representations for action recognition tasks.
View details
VideoPoet: A Large Language Model for Zero-Shot Video Generation
Dan Kondratyuk
Xiuye Gu
Jonathan Huang
Grant Schindler
Rachel Hornung
Vighnesh Birodkar
Jimmy Yan
Ming-Chang Chiu
Hassan Akbari
Josh Dillon
Agrim Gupta
Meera Hahn
Anja Hauth
David Hendon
Alonso Martinez
Kihyuk Sohn
Xuan Yang
Huisheng Wang
Lu Jiang
ICML (2024)
Preview abstract
We present VideoPoet, a language model capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs -- including images, videos, text, and audio. The training protocol follows that of Large Language Models (LLMs), consisting of two stages: pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal generative objectives within an autoregressive Transformer framework. The pretrained LLM serves as a foundation that can be adapted for a range of video generation tasks. We present empirical results demonstrating the model's state-of-the-art capabilities in zero-shot video generation, specifically highlighting VideoPoet's ability to generate high-fidelity motions. Project page: http://sites.research.google/videopoet/
View details
Preview abstract
The rate-distortion performance of neural image compression models has exceeded the state-of-the-art of non-learned codecs, but neural codecs are still far from widespread deployment and adoption. The largest obstacle is having efficient models that are feasible on a wide variety of consumer hardware. Comparative research and evaluation is difficult because of the lack of standard benchmarking platforms and by variations in hardware architectures and test environments.Through our rate-distortion-computation (RDC) study we demonstrate that neither floating-point operations (FLOPs) nor runtime are sufficient on their own to accurately rank neural compression methods. We also explore the RDC frontier, which leads to a family of model architectures with the best empirical trade-off between computational requirements and RD performance. Finally, we identify a novel neural compression architecture that yields state-of-the-art RD performance with rate savings of 23.1% over BPG (7.0% overVTM and 3.0% over ELIC) without requiring significantly more FLOPs than other learning-based codecs
View details
VCT: A Video Compression Transformer
Sung Jin Hwang
NeurIPS 2022, NeurIPS 2022
Preview abstract
We show how transformers can be used to vastly simplify neural video compression. Previous methods have been relying on an increasing number of architectural biases and priors, including motion prediction and warping operations, resulting in complex models. Instead, we independently map input frames to representations and use a transformer to model their dependencies, letting it predict the distribution of future representations given the past. The resulting video compression transformer outperforms previous methods on standard video compression data sets. Experiments on synthetic data show that our model learns to handle complex motion patterns such as panning, blurring and fading purely from data. Our approach is easy to implement, and we release code to facilitate future research.
View details
Neural Video Compression using GANs for Detail Synthesis and Propagation
European Conference on Computer Vision (2022)
Preview abstract
We present the first neural video compression method based on generative adversarial networks (GANs). Our approach significantly outperforms previous neural and non-neural video compression methods in a user study, setting a new state-of-the-art in visual quality for neural methods. We show that the GAN loss is crucial to obtain this high visual quality. Two components make the GAN loss effective: we i) synthesize detail by conditioning the generator on a latent extracted from the warped previous reconstruction to then ii) propagate this detail with high-quality flow. We find that user studies are required to compare methods, i.e., none of our quantitative metrics were able to predict all studies. We present the network design choices in detail, and ablate them with user studies.
View details
Nonlinear Transform Coding
Philip A. Chou
Sung Jin Hwang
IEEE Trans. on Special Topics in Signal Processing, 15 (2021) (to appear)
Preview abstract
We review a class of methods that can be collected under the name nonlinear transform coding (NTC), which over the past few years have become competitive with the best linear transform codecs for images, and have superseded them in terms of rate–distortion performance under established perceptual quality metrics such as MS-SSIM. We assess the empirical rate–distortion performance of NTC with the help of simple example sources, for which the optimal performance of a vector quantizer is easier to estimate than with natural data sources. To this end, we introduce a novel variant of entropy-constrained vector quantization. We provide an analysis of various forms of stochastic optimization techniques for NTC models; review architectures of transforms based on artificial neural networks, as well as learned entropy models; and provide a direct comparison of a number of methods to parameterize the rate–distortion trade-off of nonlinear transforms, introducing a simplified one.
View details
Denoising-based Image Compression for Connectomics
Alex Shapson-Coe
Richard L. Schalek
Jeff W. Lichtman
bioRxiv (2021)
Preview abstract
Connectomic reconstruction of neural circuits relies on nanometer resolution microscopy which produces on the order of a petabyte of imagery for each cubic millimeter of brain tissue. The cost of storing such data is a significant barrier to broadening the use of connectomic approaches and scaling to even larger volumes. We present an image compression approach that uses machine learning-based denoising and standard image codecs to compress raw electron microscopy imagery of neuropil up to 17-fold with negligible loss of reconstruction accuracy.
View details
Scale-Space Flow for End-to-End Optimized Video Compression
Sung Jin Hwang
2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR)
Preview abstract
Despite considerable progress on end-to-end optimized deep networks for image
compression, video coding remains a challenging task. Recently proposed
methods for learned video compression use optical flow and bilinear warping
for motion compensation and show competitive rate-distortion performance
relative to hand-engineered codecs like H.264 and HEVC. However, these
learning-based methods rely on complex architectures and training schemes
including the use of pre-trained optical flow networks, sequential training of
sub-networks, adaptive rate control, and buffering intermediate
reconstructions to disk during training. In this paper, we show that a
generalized warping operator that better handles common failure cases,
e.g. disocclusions and fast motion, can provide competitive compression
results with a greatly simplified model and training procedure. Specifically,
we propose scale-space flow, an intuitive generalization of optical
flow that adds a scale parameter to allow the network to better model
uncertainty. Our experiments show that a low-latency video compression model
(no B-frames) using scale-space flow for motion compensation can outperform
analogous state-of-the art learned video compression models while being
trained using a much simpler procedure and without any pre-trained optical
flow networks.
View details
Preview abstract
We consider the problem of using variational latent-variable models for data compression. For such models to produce a compressed binary sequence, which is the universal data representation in a digital world, the latent representation needs to be subjected to entropy coding. Range coding as an entropy coding technique is optimal, but it can fail catastrophically if the computation of the prior differs even slightly between the sending and the receiving side. Unfortunately, this is a common scenario when floating point math is used and the sender and receiver operate on different hardware or software platforms, as numerical round-off is often platform dependent. We propose using integer networks as a universal solution to this problem, and demonstrate that they enable reliable cross-platform encoding and decoding of images using variational models.
View details
Towards a Semantic Perceptual Image Metric
Sung Jin Hwang
Sergey Ioffe
Sean O'Malley
Charles Rosenberg
2018 25th IEEE Int. Conf. on Image Processing (ICIP)
Preview abstract
We present a full reference, perceptual image metric based on VGG-16, an artificial neural network trained on object classification. We fit the metric to a new database based on 140k unique images annotated with ground truth by human raters who received minimal instruction. The resulting metric shows competitive performance on TID 2013, a database widely used to assess image quality assessments methods. More interestingly, it shows strong responses to objects potentially carrying semantic relevance such as faces and text, which we demonstrate using a visualization technique and ablation experiments. In effect, the metric appears to model a higher influence of semantic context on judgements, which we observe particularly in untrained raters. As the vast majority of users of image processing systems are unfamiliar with Image Quality Assessment (IQA) tasks, these findings may have significant impact on real-world applications of perceptual metrics.
View details