Google Research

Re-contextualizing Fairness in NLP for India Data


This repository contains data resources for the paper Re-contextualizing Fairness in NLP: The Case of India, accepted at AACL-IJCNLP 2022.

The paper provides a holsitic research agenda for re-contextualizing fairness research in the specific geo-cultural context of India. We also present empirical evidence of India-specific biases being present in natural language processing (NLP) corpora and models. This data will allow for the reproduction of our analysis of biases in corpora and models along the dimensions that are relevant to the Indian context.

The dataset contains tuples in the form [identity term, attribute] (e.g., gujarati, entrepreneur). These tuples are annotated by human-raters to identify if the attribute is commonly associated with the identity term as a stereotype. The tuples were created with a combination of dictionary driven (relying on previous literature for list of characteristics and identity terms) and corpora driven (filtering based on occurrence in IndicCorp-en) approaches.

Refer to Section 5 of the paper for data curation and annotation details. We also retain individual annotations with anonymized annotator ids and self-identified gender and geographic region. Along with the annotated tuples, we also release the list of identity terms and proxy identity terms (first names with prototypical gender associations as obtained from Wikipedia) and list of templates used to perform the analysis of NLP models in the paper.