Leveraging Organizational Resources to Adapt Models to New Data Modalities
Abstract
As applications in large organizations grow and evolve, the machine learning (ML) models that power them must adapt to new data modalities that arise over the application life cycle (e.g., a new video content launch in a social media application requires existing models apply to video).
To solve this problem, organizations typically create ML pipelines from scratch.
However, this fails to utilize the large volumes of organizational resources they possess in the form of existing services and models operating over related tasks, prior data modalities, aggregate statistics, and knowledge bases.
In this paper, we demonstrate how organizational resources can help construct a common feature space that enables teams across an organization to share data and resources for new tasks across different data modalities.
This allows teams to apply methods for training data curation (e.g., weak supervision) and model training (e.g., forms of transfer learning) across data modality.
We demonstrate how this improves end-model performance and time-to-deployment when creating cross-modal pipelines.
This serves as a case study in building a system to leverage resources from across an organization for each step of the ML pipeline, including feature generation, training data curation, and model training.
While techniques to use organizational resources at each step have been studied in isolation, we consider whether and how they compose at scale in a production setting.
To solve this problem, organizations typically create ML pipelines from scratch.
However, this fails to utilize the large volumes of organizational resources they possess in the form of existing services and models operating over related tasks, prior data modalities, aggregate statistics, and knowledge bases.
In this paper, we demonstrate how organizational resources can help construct a common feature space that enables teams across an organization to share data and resources for new tasks across different data modalities.
This allows teams to apply methods for training data curation (e.g., weak supervision) and model training (e.g., forms of transfer learning) across data modality.
We demonstrate how this improves end-model performance and time-to-deployment when creating cross-modal pipelines.
This serves as a case study in building a system to leverage resources from across an organization for each step of the ML pipeline, including feature generation, training data curation, and model training.
While techniques to use organizational resources at each step have been studied in isolation, we consider whether and how they compose at scale in a production setting.