We present a state-of-the-art non-autoregressive Text-To-Speech model. The model called Parallel Tacotron 2 learns to synthesize speech with good quality without supervised duration signals and other assumptions about the token-to-frame mapping. Specifically, we introduce a novel learned attention mechanism and an iterative reconstruction loss based on Soft Dynamic Time Warping. We show that this new unsupervised model outperforms the baselines in naturalness in several diverse multi speaker evaluations. Further, we show that the explicit duration model that the model has learned can be used to control the synthesized speech.