Google Research

CST5: Code-Switched Semantic Parsing using T5

arXiv (2022)


Extending semantic parsers to code-mixed input has been a challenging problem, primarily due to lack of labeled data for supervision. In this work, we introduce CST5, a new data augmentation technique that finetune a T5 model using a small ($\approx$100 examples) seed set to generate code-mixed utterances from English utterances, allowing us to overcome the labeled data scarcity. We release over 10K annotated CS utterances alongside over 170K augmented CS utterances. Furthermore, We demonstrate the effectiveness of the augmentation technique by comparing baseline models which are trained without data augmentation to models which are trained with augmented data for varying amount of training data

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work