We present an end-to-end neural network to translate images containing text from one language to another. Traditionally, a cascaded approach of optical character recognition (OCR) followed by neural machine translation (NMT) is used to solve this problem. However, the cascaded approach compounds OCR and NMT errors, and incurs longer latency, performs poorly in multiline cases. Our simplified approach combines OCR and NMT into one end-to-end model. Our neural architecture follows the encoder-decoder paradigm, with a convolutional encoder and an autoregressive Transformer decoder. Trained end-to-end, our proposed model yields significant improvements on multiple dimensions, (i) achieves higher translation accuracy due to better error propagation, (ii) incurs lower inference latency due to smaller network size, and (iii) translates multiline paragraphs and understands reading order of the lines, (iv) eliminates source side vocabulary. We train several variations of encoders and decoders on a synthetic corpus of 120M+ English-French images and show that our approach outperforms the cascaded approach with a large margin in both the automatic metrics and the detailed side-by-side human evaluation.