Jump to Content
Sebastian Goodman

Sebastian Goodman

I am a Staff Software Engineer at Google AI working on a team that conducts Language research, with an emphasis on the intersection of Vision and Language tasks. Most recently, my focus has been on multimodal research efforts such as PaLI, which leverages large unimodal models to achieve state-of-the-art in multimodal tasks such as captioning, visual question-answering, scene-text understanding.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract We explore the boundaries of scaling up a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture. Our model achieves new levels of performance on a wide-range of varied and complex tasks, including multiple image-based captioning and question-answering tasks, image-based document understanding and few-shot (in-context) learning, as well as object detection, video question answering, and video captioning. Our model advances the state-of-the-art on most vision-and-language benchmarks considered (20+ of them). Finally, we observe emerging capabilities, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix. View details
    Preview abstract The ability to recognize and reason about text embedded in visual inputs is often lacking in vision-and-language (V&L) models, perhaps because V&L pre-training methods have often failed to include such an ability in their training objective. In this paper, we propose PreSTU, a novel pre-training recipe dedicated to scene-text understanding (STU). PreSTU introduces OCR-aware pre-training objectives that encourage the model to recognize text from an image and connect it to the rest of the image content. We implement PreSTU using a simple transformer-based encoder-decoder architecture, combined with large-scale image-text datasets with scene text obtained from an off-the-shelf OCR system. We empirically demonstrate the effectiveness of this pre-training approach on eight visual question answering and four image captioning benchmarks. View details
    Preview abstract Effective scaling and a flexible task interface enable large-capacity language models to excel at many tasks. PaLI (Pathways Language and Image model) extends these ideas to the joint modeling of language and vision. PaLI is a model that generates text based on visual and textual inputs. Using this API, PaLI is able to perform many vision, language, and multimodal tasks, across many languages. We train PaLI with two main principles: reuse of pretrained unimodal components, and joint scaling of modalities. Using large-capacity pretrained language models and vision models allows us to capitalize on their existing capabilities, while leveraging the substantial cost of training them. We scale PaLI models across three axes:the language component, the vision component, and the training data that fuses them. For the vision component, we train the largest and best-performing VisionTransformer (ViT) to date. For the data, we build an image-text training set over10B images and covering over 100 languages. PaLI inherits and enhances language-understanding capabilities, and achieves state-of-the-art in multiple vision and language tasks (image classification, image captioning, visual question-answering, scene-text understanding, etc.), based on a simple, modular, and reuse-friendly platform for modeling and scaling. View details
    Preview abstract Despite recent advances in its theoretical understanding, there still remains a significant gap in the ability of existing meta-learning theorems to explain the performance improvements in the few-shot learning setting, where the number of samples in the target tasks is severely limited. This gap originates from an assumption in the existing theories which supposes that the number of samples in the observed tasks and the number of samples in the target tasks follow the same distribution, an assumption that rarely holds in practice. By relaxing this assumption we develop two PAC-Bayesian bounds tailored for the few-shot learning setting and show that two existing meta-learning algorithms (MAML and Reptile) can be derived from our bounds, thereby bridging the gap between practice and PAC-Bayesian theorems. Furthermore, we derive a new computationally efficient PAC-Bayesian algorithm, and show it outperforms existing meta-learning algorithms on several few-shot benchmark datasets. View details
    Preview abstract This paper introduces TeaForN, an extension of the teacher-forcing method to N-grams. Sequence generation models trained with teacher-forcing suffer from problems such as exposure bias and lack of differentiability across timesteps. TeaForN addresses both these problems directly, through the use of a stack of N decoders trained to decode along a secondary time axis that allows model-parameter updates based on N prediction steps. Unlike other approaches, TeaForN can be used with a wide class of decoder architectures and requires minimal modifications from a standard teacher-forcing setup. Empirically, we show that TeaForN boosts model quality and beam-efficiency against several sequence generation benchmarks. View details
    Preview abstract We present a new dataset of image caption annotations, CHIA, which contains an order of magnitude more images than the MS-COCO dataset and represents a wider variety of both image and image caption styles. We achieve this by extracting and filtering image caption annotations from billions of Internet webpages. We also present quantitative evaluations of a number of image captioning models and show that a model architecture based on Inception-ResNet-v2 CNN for image-feature extraction and Transformer for sequence modeling achieves best performance when trained on the CHIA dataset. We present a new dataset of image caption annotations, Conceptual Captions, which contains an order of magnitude more images than the MS-COCO dataset and represents a wider variety of both images and image caption styles. We achieve this by extracting and filtering image caption annotations from billions of webpages. We also present quantitative evaluations of a number of image captioning models and show that a model architecture based on Inception-ResNet-v2 for image-feature extraction and Transformer for sequence modeling achieves the best performance when trained on the Conceptual Captions dataset. View details
    Preview abstract We describe a new multi-modal task for computer systems, posed as a combined vision-language comprehension challenge: identify the most suitable \emph{text} describing a scene, given several similar options. Accomplishing the task entails demonstrating comprehension beyond just recognizing ``keywords'' (or key-phrases) and their corresponding visual concepts, and instead requires an alignment between the representations of the two modalities that achieves a visually-grounded ``understanding'' of various linguistic elements and their dependencies. This new task also admits an easy-to-compute and well-understood metric: the accuracy in detecting the true target among the decoys. The paper makes several contributions: a generic mechanism for generating decoys from (human-created) image captions; an instance of applying this mechanism, yielding a large-scale machine comprehension dataset (based on the COCO images and captions) that we make publicly available; results on a human evaluation on this dataset, thus providing a performance ceiling; and several baseline and competitive learning approaches that illustrate the utility of the proposed framework in advancing both image and language machine comprehension. In particular, there is a large gap between human performance and state-of-the-art learning methods, suggesting a fruitful direction for future research. View details
    No Results Found