Recent developments in large-scale machine learning have created a tempting picture suggesting that by scaling up data, model size and training time properly, one can obtain a model that can be used successfully in few-shot settings in all downstream tasks. In this work, we investigate this premise empirically and provide a strong case against it. In particular, we consider image recognition task with large scale models (Vision Transformers) trained on the largest scale of available data (JFT). We show that as we improve the performance of upstream task either by scaling up or hyper-parameter and architectural choices, the performance of many downstream tasks eventually plateau. We showcase an even more extreme scenario where performance on upstream and downstream contradict each other, i.e., in order to have a better downstream performance, we need to hurt upstream accuracy. We delve deeper into understanding the reasons that give rise to these phenomena by designing interventions and investigating different components of the models which gives us crude yet useful insights into the mechanisms behind these observations.