A ``bigger is better'' explosion in the number of parameters in deep neural networks has made it increasingly challenging to make state-of-the-art networks accessible in compute-restricted environments.
Compression techniques have taken on renewed importance as a way to bridge the gap. However, evaluation of the trade-offs incurred by popular compression techniques like pruning and quantization has overwhelmingly centered on high-resource datasets. In this work, we instead consider the impact of compression in a data limited regime.
We introduce the term low-resource double bind to refer to the frequent co-occurrence of data limitations and compute resource constraints. This is in practice a frequent setting for NLP technologies for low-resource languages, yet the trade-offs in performance are poorly studied and understood.
We conduct large scale experiments and show that in low-resource regimes sparsity preserves performance on frequent sentences but has a disparate impact on infrequent phrases. Our work offers surprising insights into the relationship between capacity and generalization in data limited regimes. We show that sparsity provides surprising gains in robustness to out-of-distribution shifts. This is particularly marked for datasets that are very distinct from the training distribution. Our findings suggest sparsity can play a beneficial role at curbing memorization that impedes broader generalization.