Music Transformer is a recently developed generative model that leverages self-attention based on relative positioning to achieve state-of-the-art music generation. However, adapting the trained generative model to user preferences has proven to be cumbersome. In this work, we propose a variety of techniques to enable more fine-grained control of user input. Specifically, we condition on performance and melody inputs to learn musical representations that generalize well across a variety of different musical tasks. Empirically, we demonstrate the effectiveness of our method on the MAESTRO dataset and an internal 10,000+ hour dataset of YouTube piano performances. We achieve improvements in terms of log-likelihood and improvements in terms of mean listening scores.