In modern machine learning tasks, the presence of categorical features with extremely large vocabularies is a reality. This becomes a bottleneck when using an ML model, which generally grows at least linearly with the vocabulary size, affecting the memory, training and inference costs, as well as overfitting risk. In this work, we seek to compress the vocabulary by maximizing the mutual information between the compressed categorical feature and the target binary labels. We note the relationship of this problem to that of quantization in a discrete memoryless channel, where there exists a quadratic-time algorithm to solve the problem. Unfortunately, such an algorithm does not scale to data sets with massive vocabularies and, in this paper, we develop a distributed quasi-linear O(n log n) algorithm with provable approximation guarantees. We first observe that although entropy is a submodular function, this is not the case for mutual information between a categorical feature and label. To tackle this problem, we define a set function over a different space, which still contains the optimal solution, and prove this function is submodular. We also provide a query oracle to the submodular function that runs in amortized logarithmic time, and is easy to compute in a distributed fashion. Combining these results with a greedy algorithm allows us to achieve a (1-1/e)-approximation in quasi-linear time. Finally, we compare our proposed algorithm to several existing approaches using the large-scale Criteo learning task and demonstrate better performance in retaining mutual information and also verify the learning performance of the compressed vocabulary.