Jump to Content
Ben Caine

Ben Caine

Ben is a Research Engineer on the Google Brain team working on machine learning, computer vision, and autonomous driving.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    Preview abstract The development of language models have moved from encoder-decoder to decoder-only designs. In addition, the common knowledge has it that the two most popular multimodal tasks, the generative and contrastive tasks, tend to conflict with one another, are hard to accommodate in one architecture, and further need complex adaptations for downstream tasks. We propose a novel paradigm of training with a decoder-only model for multimodal tasks, which is surprisingly effective in jointly learning of these disparate vision-language tasks. This is done with a simple model, called MaMMUT. It consists of a single vision encoder and a text decoder, and is able to accommodate contrastive and generative learning by a novel two-pass approach on the text decoder. We demonstrate that joint learning of these diverse objectives is simple, effective, and maximizes the weight-sharing of the model across these tasks. Furthermore, the same architecture enables straightforward extensions to open-vocabulary object detection and video-language tasks. The model tackles a diverse range of tasks, while being modest in capacity. Our model achieves the state of the art on image-text and text-image retrieval, video question answering and open-vocabulary detection tasks, outperforming much larger and more extensively trained foundational models. It shows very competitive results on VQA and Video Captioning, especially considering its capacity. Ablations confirm the flexibility and advantages of our approach. View details
    Improving 3D Object Detection through Progressive Population Based Augmentation
    Shuyang Cheng
    Zhaoqi Leng
    Barret Richard Zoph
    Chunyan Bai
    Jiquan Ngiam
    Vijay Vasudevan
    Jon Shlens
    Drago Anguelov
    Preview abstract Data augmentation has been widely adopted for object detection in 3-D point clouds. All efforts have focused on manually designing specific data augmentation methods for individual architectures, however no work has attempted to automate the design of data augmentation in 3-D detection problems -- as is common in 2-D camera-based computer vision. In this work, we present a first attempt to automate the design of data augmentation policies for 3-D object detection. We describe an algorithm termed Progressive Population Based Augmentation (PPBA). PPBA learns to optimize augmentation strategies by narrowing down the search space, and adopting the best parameters discovered in previous iterations. On the KITTI test set, PPBA improves the StarNet by substantial margins on the moderate difficulty category of cars, pedestrians, and cyclists, outperforming all current state-of-the-art single-stage detection models. Additional experiments on the Waymo Open Dataset, a 20x larger dataset compared to KITTI, indicate that PPBA continues to effectively improve 3D object detection. The magnitude of the improvements may be comparable to advances in 3-D perception architectures, yet data augmentation incurs no cost at inference time. In subsequent experiments, we find that PPBA may be up to 10x more data efficient on baseline 3D detection models without augmentation, highlighting that 3D detection models may achieve competitive accuracy with far fewer labeled examples. View details
    No Results Found