TeaForN: Teacher-Forcing with N-grams
Abstract
This paper introduces TeaForN, an extension of the teacher-forcing method to N-grams.
Sequence generation models trained with teacher-forcing suffer from problems such as exposure bias and lack of differentiability across timesteps.
TeaForN addresses both these problems directly, through the use of a stack of N decoders trained to decode along a secondary time axis that allows model-parameter updates based on N prediction steps.
Unlike other approaches, TeaForN can be used with a wide class of decoder architectures and requires minimal modifications from a standard teacher-forcing setup.
Empirically, we show that TeaForN boosts model quality and beam-efficiency against several sequence generation benchmarks.
Sequence generation models trained with teacher-forcing suffer from problems such as exposure bias and lack of differentiability across timesteps.
TeaForN addresses both these problems directly, through the use of a stack of N decoders trained to decode along a secondary time axis that allows model-parameter updates based on N prediction steps.
Unlike other approaches, TeaForN can be used with a wide class of decoder architectures and requires minimal modifications from a standard teacher-forcing setup.
Empirically, we show that TeaForN boosts model quality and beam-efficiency against several sequence generation benchmarks.