 
                Anelia Angelova
            Anelia Angelova is a Principal Scientist at Google DeepMind working in the area of computer vision. She leads the Vision and Language team and previously led the Robot Vision team in Brain Robotics at Google Brain. Her most recent research focuses on vision -language and multimodal models, video understanding, semantic and 3D scene understanding, robotics perception, and real-time algorithms. She has integrated her work in production systems, including in Waymo, Google Maps, Google Cloud, X, Bard and currently contributes to Gemini. Anelia received her MS and PhD degrees in Computer Science from California Institute of Technology.
          
        
        Research Areas
      Authored Publications
    
  
  
  
    
    
  
      
        Sort By
        
        
    
    
        
          
            
              PaLI-X: On Scaling up a Multilingual Vision and Language Model
            
          
        
        
          
            
              
                
                  
                    
    
    
    
    
    
                      
                        Xi Chen
                      
                    
                
              
            
              
                
                  
                    
                    
                      
                        Josip Djolonga
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Piotr Padlewski
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Basil Mustafa
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Carlos Riquelme
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Sebastian Goodman
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Yi Tay
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Siamak Shakeri
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Daniel Salz
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Michael Tschannen
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Hexiang (Frank) Hu
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Mandar Joshi
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Matthias Minderer
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Filip Pavetić
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Gang Li
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Lucas Beyer
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Daniel Keysers
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Anurag Arnab
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Yuanzhong Xu
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Keran Rong
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Alexander Kolesnikov
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Xiaohua Zhai
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Neil Houlsby
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
          
          
          
          
             Computer Vision and Pattern Recognition Conference (CVPR) (2024)
          
          
        
        
        
          
              Preview abstract
          
          
              We explore the boundaries of scaling up a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture. Our model achieves new levels of performance on a wide-range of varied and complex tasks, including multiple image-based captioning and question-answering tasks, image-based document understanding and few-shot (in-context) learning, as well as object detection, video question answering, and video captioning. Our model advances the state-of-the-art on most vision-and-language benchmarks considered (20+ of them). Finally, we observe emerging capabilities, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix.
              
  
View details
          
        
      
    
        
          
            
              Joint Adaptive Representations for Image-Language Learning
            
          
        
        
          
            
              
                
                  
                    
                
              
            
              
                
                  
                    
                    
                  
              
            
          
          
          
          
    
    
    
    
    
            Transformers for Vision (T4V) Workshop at the Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
          
          
        
        
        
          
              Preview abstract
          
          
              Image-language transformer models have achieved tremendous success, but they come at high computational costs. We here propose a joint adaptive image-language representation learning, which adaptively and iteratively fuses the multi-modal features. This consistently reduces the model cost and size, allows the model to scale without a large increase in FLOPs or memory, and outperforms bigger and much more expensive models. With only 40M training examples and with 39 GFLOPs our model outperforms many times larger models, some reaching 800 GFLOPs.
              
  
View details
          
        
      
    
        
          
            
              Mechanical Search on Shelves with Efficient Stacking and Destacking of Objects
            
          
        
        
          
            
              
                
                  
                    
    
    
    
    
    
                      
                        Huang Huang
                      
                    
                
              
            
              
                
                  
                    
                    
                      
                        Letian Fu
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Michael Danielczuk
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Chung Min Kim
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Zachary Tam
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Jeff Ichnowski
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Brian Ichter
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Ken Goldberg
                      
                    
                  
              
            
          
          
          
          
            The International Symposium of Robotics Research (ISRR) (2023)
          
          
        
        
        
          
              Preview abstract
          
          
              Stacking increases storage efficiency in shelves, but the lack of visibility and accessibility makes the mechanical search problem of revealing and extracting target objects difficult for robots. In this paper, we extend the lateral-access mechanical search problem to shelves with stacked items and introduce two novel policies -- Distribution Area Reduction for Stacked Scenes (DARSS) and Monte Carlo Tree Search for Stacked Scenes (MCTSSS) -- that use destacking and restacking actions. MCTSSS improves on prior lookahead policies by considering future states after each potential action. Experiments in 1200 simulated and 18 physical trials with a Fetch robot equipped with a blade and suction cup suggest that destacking and restacking actions can reveal the target object with 82--100% success in simulation and 66--100% in physical experiments, and are critical for searching densely packed shelves. In the simulation experiments, both policies outperform a baseline and achieve similar success rates but take more steps compared with an oracle policy that has full state information. In simulation and physical experiments, DARSS outperforms MCTSSS in median number of steps to reveal the target, but MCTSSS has a higher success rate in physical experiments, suggesting robustness to perception noise.
              
  
View details
          
        
      
    
        
        
          
              Preview abstract
          
          
              We present a simple approach which can turn a ViT encoder into an efficient video model, which can seamlessly work with both image and video inputs. By sparsely sampling the inputs, the model is able to do training and inference from both inputs. The model is easily scalable and can be adapted to large-scale pre-trained ViTs without requiring full finetuning. The model achieves SOTA results.
              
  
View details
          
        
      
    
        
          
            
              Dynamic Pre-training of Vision-Language Models
            
          
        
        
          
            
              
                
                  
                    
                
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
    
    
    
    
    
                      
                        Wei Li
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
          
          
          
          
            ICLR 2023 Workshop on Multimodal Representation Learning (2023)
          
          
        
        
        
          
              Preview abstract
          
          
              Vision-Language pretraining aims to learn universal cross-modal representations and to create models with broad capabilities. In this paper, we propose a novel dynamic pretraining resampling for a variety of pretraining tasks. Unlike recent large-scale vision-language approaches, we show that a set of diverse self- and weakly-supervised pretraining tasks dynamically sampled according to task difficulty provides strong performance. Further, the approach is sample-efficient, using much less data and compute to address a range of downstream tasks. We show that a single 330M pretrained model using only smaller and publicly accessible datasets, achieves competitive or SOTA performance on three diverse groups of tasks: visual question answering, text-based image localization by referring expressions, and video question answering.
              
  
View details
          
        
      
    
        
          
            
              MaMMUT: A Simple Vision-Encoder Text-Decoder Architecture for MultiModal Tasks
            
          
        
        
          
            
              
                
                  
                    
                
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
    
    
    
    
    
                      
                        Xiyang Luo
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Wei Li
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Abhijit Ogale
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Luowei Zhou
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Zhifeng Chen
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
          
          
          
          
            Transactions on Machine Learning Research (2023)
          
          
        
        
        
          
              Preview abstract
          
          
              The development of language models have moved from encoder-decoder to decoder-only designs. In addition, the common knowledge has it that the two most popular multimodal tasks, the generative and contrastive tasks, tend to conflict with one another, are hard to accommodate in one architecture, and further need complex adaptations for downstream tasks. We propose a novel paradigm of training with a decoder-only model for multimodal tasks, which is surprisingly effective in jointly learning of these disparate vision-language tasks. This is done with a simple model, called MaMMUT. It consists of a single vision encoder and a text decoder, and is able to accommodate contrastive and generative learning by a novel two-pass approach on the text decoder. We demonstrate that joint learning of these diverse objectives is simple, effective, and maximizes the weight-sharing of the model across these tasks. Furthermore, the same architecture enables straightforward extensions to open-vocabulary object detection and video-language tasks. The model tackles a diverse range of tasks, while being modest in capacity. Our model achieves the state of the art on image-text and text-image retrieval, video question answering and open-vocabulary detection tasks, outperforming much larger and more extensively trained foundational models. It shows very competitive results on VQA and Video Captioning, especially considering its capacity. Ablations confirm the flexibility and advantages of our approach.
              
  
View details
          
        
      
    
        
          
            
              PaLI: A Jointly-Scaled Multilingual Language-Image Model
            
          
        
        
          
            
              
                
                  
                    
    
    
    
    
    
                      
                        Xi Chen
                      
                    
                
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Piotr Padlewski
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Daniel Salz
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Sebastian Alexander Goodman
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Basil Mustafa
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Lucas Beyer
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Alexander Kolesnikov
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Keran Rong
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Hassan Akbari
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Linting Xue
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        James Bradbury
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Chao Jia
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Carlos Riquelme
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Xiaohua Zhai
                      
                    
                  
              
            
              
                
                  
                    
                    
                      
                        Neil Houlsby
                      
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
          
          
          
          
            International Conference on Learning Representations (ICLR) (2023)
          
          
        
        
        
          
              Preview abstract
          
          
              Effective scaling and a flexible task interface enable large-capacity language models to excel at many tasks. PaLI (Pathways Language and Image model) extends these ideas to the joint modeling of language and vision.  PaLI is a model that generates text based on visual and textual inputs. Using this API, PaLI is able to perform many vision, language, and multimodal tasks, across many languages. We train PaLI with two main principles: reuse of pretrained unimodal components, and joint scaling of modalities. Using large-capacity pretrained language models and vision models allows us to capitalize on their existing capabilities, while leveraging the substantial cost of training them.  We scale PaLI models across three axes:the language component, the vision component, and the training data that fuses them. For the vision component, we train the largest and best-performing VisionTransformer (ViT) to date. For the data, we build an image-text training set over10B images and covering over 100 languages.
PaLI inherits and enhances language-understanding capabilities, and achieves state-of-the-art in multiple vision and language tasks (image classification, image captioning, visual question-answering, scene-text understanding, etc.), based on a simple, modular, and reuse-friendly platform for modeling and scaling.
              
  
View details
          
        
      
    
        
          
            
              Diversifying Joint Vision-Language Tokenization Learning
            
          
        
        
          
            
              
                
                  
                    
    
    
    
    
    
                      
                        Vardaan Pahuja
                      
                    
                
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
          
          
          
          
            Transformers for Vision (T4V) Workshop at the Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
          
          
        
        
        
          
              Preview abstract
          
          
              Building joint representations across images and text is an essential step for tasks such as Visual Question Answering and Video Question Answering. In this work, we find that the representations must not only jointly capture features from both modalities but should also be diverse for better generalization performance. To this end, we propose joint vision-language representation learning by diversifying the tokenization learning process, enabling tokens which are sufficiently disentangled from each other to be learned from both modalities. We observe that our approach outperforms the baseline models in a majority of settings and is competitive with state-of-the-art methods.
              
  
View details
          
        
      
    
        
          
            
              Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers
            
          
        
        
          
            
              
                
                  
                    
                
              
            
              
                
                  
                    
                    
                  
              
            
              
                
                  
                    
                    
                  
              
            
          
          
          
          
    
    
    
    
    
            Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
          
          
        
        
        
          
              Preview abstract
          
          
              We present Region-aware Open-vocabulary Vision Transformers (RO-ViT) – a contrastive image-text pretraining recipe to bridge the gap between image-level pretraining and open-vocabulary object detection. At the pretraining phase, we propose to randomly crop and resize regions of positional embeddings instead of using the whole image positional embeddings. This better matches the use of positional embeddings at region-level in the detection finetuning phase. In addition, we replace the common softmax cross entropy loss in contrastive learning with focal loss to better learn the informative yet difficult examples. Finally, we leverage recent advances in novel object proposals to improve open-vocabulary detection finetuning. We evaluate our full model on the LVIS and COCO open-vocabulary detection benchmarks and zero-shot transfer. RO-ViT achieves a state-of-the-art 32.1 APr on LVIS, surpassing the best existing approach by +5.8 points in addition to competitive zero-shot transfer detection. Surprisingly, RO-ViT improves the image-level representation as well and achieves the state of the art on 9 out of 12 metrics on COCO and Flickr image-text retrieval benchmarks, outperforming competitive approaches with larger models.
              
  
View details
          
        
      
    
        
        
          
              Preview abstract
          
          
              We present F-VLM, a simple open-vocabulary object detection method built upon Frozen Vision and Language Models. F-VLM simplifies the current multi-stage training pipeline by eliminating the need for knowledge distillation or detection-tailored pretraining. Surprisingly, we observe that a frozen VLM: 1) retains the locality-sensitive features necessary for detection, and 2) is a strong region classifier. We finetune only the detector head and combine the detector and VLM outputs for each region at inference time. F-VLM shows compelling scaling behavior and achieves +6.5 mask AP improvement over the previous state of the art on novel categories of LVIS open-vocabulary detection benchmark. In addition, we demonstrate very competitive results on COCO open-vocabulary detection benchmark and cross dataset transfer detection, in addition to significant training speed-up and compute savings. Code will be released at https://sites.google.com/corp/view/f-vlm/home.
              
  
View details
          
        
      
    