CST5: Code-Switched Semantic Parsing using T5
Abstract
Extending semantic parsers to code-mixed input has been a challenging problem, primarily due to lack of labeled data for supervision. In this work, we introduce CST5, a new data augmentation technique that finetune a T5 model using a small ($\approx$100 examples) seed set to generate code-mixed utterances from English utterances, allowing us to overcome the labeled data scarcity. We release over 10K annotated CS utterances alongside over 170K augmented CS utterances. Furthermore, We demonstrate the effectiveness of the augmentation technique by comparing baseline models which are trained without data augmentation to models which are trained with augmented data for varying amount of training data