Proper Reuse of Image Classification Features Improves Object Detection

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR(2022), pp. 13628-13637


A largely accepted practice in transfer learning is to pre-train a model on a data-abundant upstream task and using the pre-trained weights for model initialization on the downstream task. Specifically, in Object Detection (OD) it is common to initialize the feature backbone with pre-trained ImageNet classifier weights and fine-tune those weights along with the other detection model parameters. Recent work has shown that this practice is not strictly necessary and that it is possible to train an object detector from scratch by training for much longer. In this work we investigate the opposite end of the training spectrum and keep the feature backbone frozen during object detection training, preserving the classifier initialization. Contrary to the common belief that object detectors benefit from end-to-end training, we conjecture that the weight initialization obtained from training on a classifier contains useful knowledge that is forgotten by fine-tuning or avoided entirely when training from scratch, with negative consequences for long-tail classes. As an immediate contribution of our findings, we show that it is possible to train an off-the-shelf object detection model with similar if not superior performance while significantly reducing the need for computational resources, both memory-wise and computationally-wise (FLOPs). The performance benefits of the proposed upstream task knowledge preservation is even more clear when stratifying results by classes and number of annotations available. Our results on MSCOCO, LVIS and Pascal VOC show that our extreme formulation of model reuse has a clear positive impact on full-shot object detection and also on typical hard cases, such as classes with low number of annotations---such as those found in long tail object recognition and few-shot learning.