Synthetic Datasets for Neural Program Synthesis

Richard Shin
Neel Kant
Kavi Gupta
Chris Bender
Brandon Trabucco
Rishabh Singh
Dawn Song
ICLR (2019)

Abstract

The goal of program synthesis is to automatically generate programs in a particular
language from corresponding specifications, e.g. input-output behavior. Many
current approaches achieve impressive results after training on randomly generated
I/O examples in limited domain-specific languages (DSLs), as with string transformations in RobustFill. However, we empirically discover that applying test input
generation techniques for languages with control flow and rich input space causes
deep networks to generalize poorly to certain data distributions; to correct this, we
propose a new methodology for controlling and evaluating the bias of synthetic
data distributions over both programs and specifications. We demonstrate, using
the Karel DSL and a small Calculator DSL, that training deep networks on these
distributions leads to improved cross-distribution generalization performance.