We propose an end-to-end GAN-based model to perform an unconditional synthesis of complex scenes. Our model first synthesizes a realistic segmentation layout, and then synthesizes a realistic scene conditioned on that layout. For the former, we use an unsupervised progressive segmentation generation network which captures the distribution of the realistic semantic scene layouts. For the latter, we use a conditional segmentation-to-image synthesis network which captures the distribution of photo-realistic images conditioned on the semantic layout. We show that end-to-end outperforms state-of-the-art generative models in unsupervised image synthesis on two challenging domains in terms of the Frechet Inception Distance and user-study evaluations. Moreover, we demonstrate the generated segmentation maps can be used as additional training data to strongly improve the recent segmentation-to-image synthesis networks.