Jump to Content

Weilong Yang

Weilong Yang is a senior staff software engineer, and a Tech Lead / Manager in Machine Perception at Google Research. He and his team are exploring new methods of applying AI in content creation, e.g., design creation, image / video ads generation, video highlight selection, thumbnail image generation, etc. Dr. Yang is involved in organizing the AI for content creation workshop in CVPR 2020, and he has served in the program committees of CVPR, ECCV, ICCV, and NIPS for many years. His research interests include deep learning, generative model, video classification, etc.

Research Areas

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Recent years have witnessed the rapid progress of generative adversarial networks (GANs). However, the success of the GAN models hinges on a large amount of training data. This work proposes a regularization approach for training robust GAN models on limited data. We theoretically show a connection between the regularized loss and an f-divergence called LeCam-divergence, which we find is more robust under limited training data. Extensive experiments on several benchmark datasets demonstrate that the proposed regularization scheme 1) improves the generalization performance and stabilizes the learning dynamics of GAN models under limited training data, and 2) complements the recent data augmentation methods. These properties facilitate training GAN models to achieve state-of-the-art performance when only limited training data of the ImageNet benchmark is available. View details
    Preview abstract Non-linear video editing requires composing footage utilizing visual framing and temporal effects, which can be a time-consuming process. Often, editors borrow effects from existing creation and develop personal editing styles. In this paper, we propose an automatic approach that extracts editing styles in a source video and applies the edits to matched footage for video creation. Our Computer Vision based techniques detects framing, content type, playback speed, and lighting of each input video segment. By applying a combination of these features, we demonstrate an effective method that transfers the visual and temporal styles from professionally edited videos to unseen raw footage. Our experiments with real-world input videos received positive feedback from survey participants. View details
    Preview abstract Graphic design is essential for visual communication with layouts being fundamental to composing attractive designs. Layout generation differs from pixel-level image synthesis and is unique in terms of the requirement of mutual relations among the desired components. We propose a method for design layout generation that can satisfy user-specified constraints. The proposed neural design network (NDN) consists of three modules. The first module predicts a graph with complete relations from a graph with user-specified relations. The second module generates a layout from the predicted graph. Finally, the third module fine-tunes the predicted layout. Quantitative and qualitative experiments demonstrate that the generated layouts are visually similar to real design layouts. We also construct real designs based on predicted layouts for a better understanding of the visual quality. Finally, we demonstrate a practical application on layout recommendation. View details
    Preview abstract Image generation from scene description is a cornerstone technique for the controlled generation, which is beneficial to applications such as content creation and image editing. In this work, we aim to synthesize images from scene description with retrieved patches as reference. We propose a differentiable retrieval module. With the differentiable retrieval module, we can (1) make the entire pipeline end-to-end trainable, enabling the learning of better feature embedding for retrieval; (2) encourage the selection of mutually compatible patches with additional objective functions. We conduct extensive quantitative and qualitative experiments to demonstrate that the proposed method can generate realistic and diverse images, where the retrieved patches are reasonable and mutually compatible. View details
    Preview abstract Performing controlled experiments on noisy data is essential in understanding deep learning across noise levels. Due to the lack of suitable datasets, previous research has only examined deep learning on controlled synthetic label noise, and real-world label noise has never been studied in a controlled setting. This paper makes three contributions. First, we establish the first benchmark of controlled real label noise (obtained from image search). This new benchmark will enable us to study the image search label noise in a controlled setting for the first time. The second contribution is a simple but highly effective method to overcome both synthetic and real noisy labels. We show that our method achieves the best result on our dataset as well as on two public benchmarks (CIFAR and WebVision). Third, we conduct the largest study by far into understanding deep neural networks trained on noisy labels across different noise levels, noise types, network architectures, methods, and training settings. We will release the data and code to reproduce our results. View details
    Preview abstract We present a method for learning an embedding that places images of humans in similar poses nearby. This embedding can be used as a direct method of comparing images based on human pose, avoiding potential challenges of estimating body joint positions. Pose embedding learning is formulated under a triplet-based distance criterion. A deep architecture is used to allow learning of a representation capable of making distinctions between different poses. Experiments on human pose matching and retrieval from video data demonstrate the potential of the method. View details
    Preview abstract We consider the problem of content-based automated tag learning. In particular, we address semantic varia- tions (sub-tags) of the tag. Each video in the training set is assumed to be associated with a sub-tag label, and we treat this sub-tag label as latent information. A latent learning framework based on LogitBoost is proposed which jointly considers both tag label and the latent sub-tag label. The latent sub-tag information is exploited in our frame- work to assist the learning of our end goal, i.e., tag predic- tion. We use the cowatch information to initialize the learn- ing process. In experiments, we show that the proposed method achieves significantly better results over baselines on a large-scale testing video set which contains about 50 million YouTube videos. View details
    No Results Found