Consistency-based active learning: Minimizing the labeling budgets
Abstract
Active learning (AL) combines data labeling and model training to minimize the labeling cost by prioritizing the selection of high
value data that can best improve model performance. In pool-based active learning, accessible unlabeled data are not used for model training in
most conventional methods. Here, we propose to unify unlabeled sample
selection and model training towards minimizing labeling cost, and make
two contributions towards that end. First, we exploit both labeled and
unlabeled data using semi-supervised learning (SSL) to distill information from unlabeled data during the training stage. Second, we propose a
consistency-based sample selection metric that is coherent with the training objective such that the selected samples are effective at improving
model performance. We conduct extensive experiments on image classification tasks. The experimental results on CIFAR-10, CIFAR-100 and
ImageNet demonstrate the superior performance of our proposed method
with limited labeled data, compared to the existing methods and the alternative AL and SSL combinations. Additionally, we also study an important yet under-explored problem – “When can we start learning-based
AL selection?”. We propose a measure that is empirically correlated with
the AL target loss and is potentially useful for determining the proper
starting point of learning-based AL methods
value data that can best improve model performance. In pool-based active learning, accessible unlabeled data are not used for model training in
most conventional methods. Here, we propose to unify unlabeled sample
selection and model training towards minimizing labeling cost, and make
two contributions towards that end. First, we exploit both labeled and
unlabeled data using semi-supervised learning (SSL) to distill information from unlabeled data during the training stage. Second, we propose a
consistency-based sample selection metric that is coherent with the training objective such that the selected samples are effective at improving
model performance. We conduct extensive experiments on image classification tasks. The experimental results on CIFAR-10, CIFAR-100 and
ImageNet demonstrate the superior performance of our proposed method
with limited labeled data, compared to the existing methods and the alternative AL and SSL combinations. Additionally, we also study an important yet under-explored problem – “When can we start learning-based
AL selection?”. We propose a measure that is empirically correlated with
the AL target loss and is potentially useful for determining the proper
starting point of learning-based AL methods