Google Research

InFormal Dataset


InFormal is a formality style transfer dataset for four Indic Languages. The dataset is made up of a pair of sentences and a corresponding gold label identifying the more formal as well as semantic similarity. This dataset can be used as an evaluation set for style transfer tasks in Indic Languages. InFormal contains sentence pairs from 4 Indic Languages - Hindi, Telugu, Kannada and Bengali. The annotator is asked to choose the more formal sentence and rate the semantic similarity between the pair on a 3 point scale. This dataset is intended to be used as a test set and is part of the evaluation suite proposed in the ACL Paper.