As state of the art network models routinely grow to architectures with billions and even trillions of learnable parameters, the need to efficiently store and retrieve these models into working memory becomes a more pronounced bottleneck. This is felt most severely in efforts to port models to personal devices, such as consumer cell phones, which now commonly include GPU and TPU processors designed to handle the enormous computational burdens associated with deep networks. In this paper, we present novel techniques for dramatically reducing the number of free parameters in deep network models with the explicit goals of (1) model compression with little or no model decompression overhead at inference time and (2) reducing the number of free parameters in arbitrary model without requiring any modifications to the architecture. We examine four techniques that build on each other, and provide insight into when and how each technique operates. Accuracy as a function of free parameters is measured on two very different deep networks: ResNet and Vision Transformer. On the latter, we find that we can reduce the number of parameters by 20\% with no loss in accuracy.