- Vijaya Teja Rayavarapu
- Bharath Bhat
- Myra Nam
- Vikas Bahirwani
- Shobha Diwakar
Abstract
An ads ecosystem needs robust, scalable mechanisms to safeguard users from bad quality ads. Contemporary ad creatives typically contain different combinations of modalities like text, images and video, and as such, any system that flags bad quality ad content needs a holistic multimodal representation of the ad. In this paper, we demonstrate that modern Transformer based neural network models are effective multimodal learners. We report significant performance gains in YouTube video ads on the task of content quality prediction by transitioning to Transformer based models from simpler feed-forward neural networks. We provide ablation studies to understand the impact of each input modality, and compare various flavors of Transformer architectures. We hope that our experiments help practitioners looking to incorporate these powerful multimodal models into other parts of the ads ecosystem.
Research Areas
Learn more about how we do research
We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work