Retraining modern deep learning systems can lead to variations in model performance even when trained using the same data and hyperparameters by simply using different random seeds. This phenomenon is known as model churn or model jitter. This issue is often exacerbated in real world settings, where noise may be introduced in the data collection process. In this work we tackle the problem of stable retraining with a novel focus on structured prediction for conversational semantic parsing. We first quantify the model churn by introducing metrics for agreement between predictions across multiple re-trainings. Next, we devise realistic scenarios for noise injection and demonstrate the effectiveness of various churn reduction techniques such as ensembling and distillation. Lastly, we discuss practical tradeoffs between such techniques and show that codistillation provides a sweet spot in terms of churn reduction with only a modest increase in resource usage.