Yongqin
I am a research scientist at Google Zurich. Prior to that, I was a post-doctoral researcher with Luc Van Gool in the Computer Vision Lab at ETH Zurich. I completed my PhD summa cum laude at the Max Planck Institute Informatics under the supervision of Bernt Schiele and Zeynep Akata. My research focuses on vision-language model pretraining and its applications in vision tasks.
Research Areas
Authored Publications
Sort By
Preview abstract
Image-Text pretraining on a web-scale image caption dataset has become the default recipe for open vocabulary classification and retrieval models thanks to the success of CLIP and its variants. Several works have also used CLIP features for dense prediction tasks and have shown the emergence of open-set abilities. However, the contrastive objective only focuses on image and text alignment and does not incentivise image feature learning for dense prediction tasks. In this work, we propose the simple addition of local-to-global correspondence learning by self-distillation as an additional objective for contrastive pre-training to propose SILC. We show that distilling local image features from an EMA teacher model significantly improves model performance on tasks including classification, retrieval, and especially segmentation. We further show that SILC scales better with the same training duration compared to the baselines. Our improved SILC sets a new state-of-the-art for zero-shot classification, few shot classification, image retrieval, zero-shot segmentation, and open vocabulary segmentation.
View details