Magika: Deep Learning Meets Content Type Detection
Abstract
The task of content-type detection, which entails determining the data type encoded by byte streams, has a long history within the realm of computing and nowadays it is a key primitive for critical automated pipelines. The first program ever developed to perform this task is "file", which shipped with Bell Labs UNIX over five decades ago. Since then, a number of additional tools have been developed, but, despite their importance, to date it is not clear how well these approaches perform, and whether modern techniques can improve over the state of the art.
This paper sheds light on this overlooked area. We collect a dataset of more than 26M samples, and we perform the first large-scale evaluation of existing content type tools. Then, we introduce Magika, a new content type detection tool based on deep learning. Magika is designed to be fast (5ms inference time), even on a single CPU, thus making it a viable replacement for existing command line tools and suitable for large-scale automated pipelines.
Magika achieves 99\%+ average precision and recall, which is a double-digit % accuracy improvement (in absolute terms) over the state of the art.
As a testament to its real-world utility, we are working with a large email provider and with Visual Studio Code developers on integrating Magika to be their reference content-type detector. To ease reproducibility, we release all our artifacts, including the tool, the model, the training pipeline, the dataset collection codebase, and details about our dataset.
This paper sheds light on this overlooked area. We collect a dataset of more than 26M samples, and we perform the first large-scale evaluation of existing content type tools. Then, we introduce Magika, a new content type detection tool based on deep learning. Magika is designed to be fast (5ms inference time), even on a single CPU, thus making it a viable replacement for existing command line tools and suitable for large-scale automated pipelines.
Magika achieves 99\%+ average precision and recall, which is a double-digit % accuracy improvement (in absolute terms) over the state of the art.
As a testament to its real-world utility, we are working with a large email provider and with Visual Studio Code developers on integrating Magika to be their reference content-type detector. To ease reproducibility, we release all our artifacts, including the tool, the model, the training pipeline, the dataset collection codebase, and details about our dataset.