Improving the speed of neural networks on CPUs

Vincent Vanhoucke; Andrew Senior; Mark Z. Mao

Improving the speed of neural networks on CPUs

Vincent Vanhoucke

Andrew Senior

Mark Z. Mao

Deep Learning and Unsupervised Feature Learning Workshop, NIPS 2011

Google Scholar

Abstract

Recent advances in deep learning have made the use of large, deep neural networks with tens of millions of parameters suitable for a number of applications that require real-time processing. The sheer size of these networks can represent a challenging computational burden, even for modern CPUs. For this reason, GPUs are routinely used instead to train and run such networks. This paper is a tutorial for students and researchers on some of the techniques that can be used to reduce this computational cost considerably on modern x86 CPUs. We emphasize data layout, batching of the computation, the use of SSE2 instructions, and particularly
leverage SSSE3 and SSE4 ﬁxed-point instructions which provide a 3X improvement over an optimized ﬂoating-point baseline. We use speech recognition as an example task, and show that a real-time hybrid hidden Markov model / neural network (HMM/NN) large vocabulary system can be built with a 10X speedup over an unoptimized baseline and a 4X speedup over an aggressively optimized ﬂoating-point baseline at no cost in accuracy. The techniques described extend readily to neural network training and provide an effective alternative to the use of specialized hardware.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Improving the speed of neural networks on CPUs

Abstract

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs