Pseudo label is better than human label

Zhouyuan Huo
INTERSPEECH 2022(2022)


Human labeling is expensive. Labeling is the most painful step for ML production. It’s widely believed that data is the new gold and big tech companies have an unfair advantage. Is it true that unlimited data unlimits model performance? In this study, we show 1k hrs human labeled data is enough for the best ASR model. The model trained with 1k hrs human labels and 26k hrs pseudo labels has better WERs than the model with 27k hrs human labels. Pseudo label training improves WERs of the production model by a significant margin; 5.9 to 5.1 on voice search. It means pseudo label quality is better than human label. To have quality pseudo labels, we utilized recent self/semi-supervised learning for a large ASR model.