We present a simple but effective technique for deep semi-supervised learning. On labeled examples, the model is trained with standard cross-entropy loss. On an unlabeled example, the model first performs inference (acting as a “teacher”) and then learns from the resulting output distribution (acting as a “student”). We deviate from prior work by adding multiple auxiliary student softmax layers to the model. The input to each student layer is a sub-network of the full model that has a restricted view of the input (e.g., only seeing one region of an image). The students can learn from the teacher because the teacher sees more of each example. Concurrently, the students improve the representations used by the teacher as they learn to make predictions with limited data. We propose variants of our method for CNN image classifiers and BiLSTM sequence taggers. When combined with Virtual Adversarial Training, it improves upon the current state-of-the-art on semi-supervised CIFAR-10 and semi-supervised SVHN. We also apply it to train semi-supervised sequence taggers for four Natural Language Processing tasks using hundreds of millions of sentences of unlabeled data. The resulting models improve upon or are competitive with the current state-of-the-art on every task.