Bryan Seybold
Research Areas
Authored Publications
Sort By
VideoPoet: A Large Language Model for Zero-Shot Video Generation
Dan Kondratyuk
Xiuye Gu
Jonathan Huang
Grant Schindler
Rachel Hornung
Vighnesh Birodkar
Jimmy Yan
Ming-Chang Chiu
Hassan Akbari
Josh Dillon
Agrim Gupta
Meera Hahn
Anja Hauth
David Hendon
Alonso Martinez
Kihyuk Sohn
Xuan Yang
Huisheng Wang
Lu Jiang
ICML (2024)
Preview abstract
We present VideoPoet, a language model capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs -- including images, videos, text, and audio. The training protocol follows that of Large Language Models (LLMs), consisting of two stages: pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal generative objectives within an autoregressive Transformer framework. The pretrained LLM serves as a foundation that can be adapted for a range of video generation tasks. We present empirical results demonstrating the model's state-of-the-art capabilities in zero-shot video generation, specifically highlighting VideoPoet's ability to generate high-fidelity motions. Project page: http://sites.research.google/videopoet/
View details
Learning Audio-Video Modalities from Image Captions
Paul Hongsuck Seo
Anja Hauth
Santiago Manen
European Conference on Computer Vision (2022)
Preview abstract
There has been a recent explosion of large-scale image-text datasets, as images with alt-text captions can be easily obtained online.Obtaining large-scale, high quality data for video in the form of text-video and text-audio pairs however, is more challenging. To close this gap we propose a new video mining pipeline which involves transferring captions from image captioning datasets to video clips with no additional manual effort. Using this pipeline, we create a new large-scale, weakly labelled audio-video captioning dataset consisting of millions of paired clips and captions. We show that training a multimodal transformer based model on this data achieves competitive performance on video retrieval and video captioning, matching or even outperforming HowTo100M pretraining with 20x fewer clips. We also show that our mined clips are suitable for text-audio pretraining, and achieve state of the art results for the task of audio retrieval.
View details
Instance Embedding Transfer to Unsupervised Video Object Segmentation
Siyang Li
Alexey Vorobyov
Qin Huang
C.-C. Jay Kuo
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)
Preview abstract
We propose an unsupervised video object segmentation method in this work by transferring the knowledge of image-based instance embedding networks. The instance embedding networks produce an embedding for each pixel and identify all pixels belonging to the same object. It is observed that instance embeddings trained by static images are stable over consecutive video frames. Thus, we apply the trained networks to video object segmentation without model retraining or online fine-tuning and incorporate them with objectness from instance segmentation model and optical flow features. The stability of instance embedding is analyzed, and instability mitigation is studied. Our method outperforms state-of-the-art unsupervised segmentation methods in the DAVIS dataset and is competitive on the Segtrack-v2 data set.
View details
Rethinking the Faster R-CNN Architecture for Temporal Action Localization
Jia Deng
Yu-Wei Chao
CVPR 2018
Preview abstract
We propose TAL-Net, an improved approach to temporal action localization in video that is inspired by the Faster R-CNN object detection framework. TAL-Net addresses three key shortcomings of existing approaches: (1) we improve receptive field alignment using a multi-scale architecture that can accommodate extreme variation in action durations; (2) we better exploit the temporal context of actions
for both proposal generation and action classification by appropriately extending receptive fields; and (3) we explicitly consider multi-stream feature fusion and demonstrate that fusing motion late is important. We achieve state-of-the-art performance for both action proposal and localiza-
tion on THUMOS’14 detection benchmark and competitive performance on ActivityNet challenge.
View details
Self-Supervised Learning of Structure and Motion from Video
Aikaterini Fragkiadaki
arxiv (2017)
Preview abstract
We propose SfM-Net, a geometry-aware neural network
for motion estimation in videos that decomposes frame-toframe
pixel motion in terms of scene and object depth, camera
motion and 3D object rotations and translations. Given
a sequence of frames, SfM-Net predicts depth, segmentation,
camera and rigid object motions, converts those into
a dense frame-to-frame motion field (optical flow), differentiably
warps frames in time to match pixels and backpropagates.
The model can be trained with various degrees
of supervision: 1) completely unsupervised, 2) supervised
by ego-motion (camera motion), 3) supervised by
depth (e.g., as provided by RGBD sensors), 4) supervised
by ground-truth optical flow. We show that SfM-Net successfully
estimates segmentation of the objects in the scene,
even though such supervision is never provided. It extracts
meaningful depth estimates or infills depth of RGBD sensors
and successfully estimates frame-to-frame camera displacements.
SfM-Net achieves state-of-the-art optical flow
performance. Our work is inspired by the long history of
research in geometry-aware motion estimation, Simultaneous
Localization and Mapping (SLAM) and Structure from
Motion (SfM). SfM-Net is an important first step towards
providing a learning-based approach for such tasks. A major
benefit over the existing optimization approaches is that
our proposed method can improve itself by processing more
videos, and by learning to explicitly model moving objects
in dynamic scenes.
View details
CNN Architectures for Large-Scale Audio Classification
Jort F. Gemmeke
Devin Platt
Malcolm Slaney
International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE (2017)
Preview abstract
Convolutional Neural Networks (CNNs) have proven very effective
in image classification and have shown promise for audio classification.
We apply various CNN architectures to audio and investigate
their ability to classify videos with a very large scale data set of 70M
training videos (5.24 million hours) with 30,871 labels. We examine
fully connected Deep Neural Networks (DNNs), AlexNet [1],
VGG [2], Inception [3], and ResNet [4]. We explore the effects of
training with different sized subsets of the 70M training videos. Additionally
we report the effect of training over different subsets of
the 30,871 labels. While our dataset contains video-level labels, we
are also interested in Acoustic Event Detection (AED) and train a
classifier on embeddings learned from the video-level task on AudioSet
[5]. We find that derivatives of image classification networks
do well on our audio classification task, that increasing the number
of labels we train on provides some improved performance over subsets
of labels, that performance of models improves as we increase
training set size, and that a model using embeddings learned from
the video-level task do much better than a baseline on the AudioSet
classification task.
View details