Trevor Gale
Trevor joined the Brain Team as part of the 2018 Brain Residency Program after doing his undergrad at Northeastern University. He started his PhD at Stanford University in 2019 and works with Matei Zaharia in the Future Data Systems group. In 2022, Trevor re-joined the Brain Team in the Brain Computer Architecture Research group. Trevor is broadly interested in next-generation machine learning models and their consequences for computer systems.
At Northeastern, Trevor worked with David Kaeli on high-performance computing, general-purpose graphics processing units, and workload characterization, and with Jennifer Dy on deep learning and medical imaging. Trevor previously worked on large-scale distributed deep learning and computer vision at Samsung Research, and on the deep learning frameworks team at Nvidia, where he created the DALI data loading and augmentation framework and prototypes for the NVJPEG library.
Trevor is originally from Maine, and enjoys eating food, surfing, skiing, running, swimming and cats. His favorite type of tea is mint. He loves coffee and as of joining Google developed enough of a caffeine tolerance to be able to drink it without getting jittery. He does not enjoy motorcycles that are too loud, and is no good at basketball. Previously, Trevor had been eating lots of lentils.
Authored Publications
Sort By
Rigging The Lottery: Making All Tickets Winners
Jacob Menick
Erich Elsen
International Conference of Machine Learning (2020)
Preview abstract
Recent work (Kalchbrenner et al., 2018) has demonstrated that sparsity in theparameters of neural networks leads to more parameter and floating-point oper-ation (flop) efficient networks and that these gains also translate into inferencetime reductions. There is a large body of work (Molchanov et al., 2017; Zhu &Gupta, 2017; Louizos et al., 2017; Li et al., 2016; Guo et al., 2016) on variousways ofpruningnetworks that require dense training but yield sparse networksfor inference. This limits the size of the largest trainable model to the largesttrainable dense model. Concurrently, other work (Mocanu et al., 2018; Mostafa& Wang, 2019; Bellec et al., 2017), have introduced dynamic sparse reparameter-ization training methods that allow a network to be trained while always sparse.However, they either do not reach the accuracy of pruning, or do not have a fixedFLOP cost due to parameter re-allocation during training. This work introducesa new method that does not require parameter re-allocation for end-to-end sparsetraining that matches and even exceeds the accuracy of dense-to-sparse methods.We show that this method requires less FLOPs to achieve a given level of accu-racy than previous methods. We also provide some insights into why static sparsetraining fails to find good minima and dynamic reparameterization succeeds.
View details
Preview abstract
We rigorously evaluate three state-of-the-art techniques for inducing sparsity in deep neural networks on two large-scale learning tasks: Transformer trained on WMT 2014 English-to-German, and ResNet-50 trained on ImageNet. Across thousands of experiments, we demonstrate that complex techniques (Molchanov et al., 2017; Louizos et al., 2017b) shown to yield high compression rates on smaller datasets perform inconsistently, and that simple magnitude pruning approaches achieve comparable or better results. Additionally, we replicate the experiments performed by (Frankle & Carbin, 2018) and (Liu et al., 2018) at scale and show that unstructured sparse architectures learned through pruning cannot be trained from scratch to the same test set performance as a model trained with joint sparsification and optimization. Together, these results highlight the need for large-scale benchmarks in the field of model compression. We open-source our code, top performing model checkpoints, and results of all hyperparameter configurations to establish rigorous baselines for future work on compression and sparsification.
View details