TimeDial presents a crowdsourced English challenge set, for temporal commonsense reasoning, formulated as a multiple choice cloze task with ~1.5k carefully curated dialogs. The dataset is derived from the DailyDialog (Li et al., 2017), which is a multi-turn dialog corpus.
For each masked span, there is more than one correct answer in the options in ~1.1k dialogs. This makes the task more challenging for models since more comprehensive understanding of the context is required to recognize all the correct choices. In our dataset, we guarantee two incorrect answers for each masked span. Some incorrect options are selected to be spuriously correlated with the dialog context. For example, we include temporal spans in the dialog context as negative options, which will challenge models that rely primarily only on shallow pattern matching without correct temporal reasoning.
See README for more information.