In instruction conditioned navigation, agents interpret natural language and their surroundings to navigate in an environment. Datasets for such tasks typically contain pairs of these instructions and reference trajectories, but current popular evaluation metrics fail to properly account for the fidelity of agents to the those trajectories. To address this, we introduce the normalized Dynamic Time Warping (nDTW) metric. nDTW softly penalizes deviations from the reference path, is naturally sensitive to the order of the nodes composing each path, is suited for both continuous and graph-based evaluations, and can be efficiently calculated. Further, we define SDTW, which constrains nDTW to only successful episodes and effectively captures both success and fidelity. We collect human similarity judgments for simulated paths and find our DTW metrics correlates better with human rankings than all other metrics. We also show that using nDTW as a reward signal for agents using reinforcement learning improves performance on both the Room-to-Room and Room-for-Room datasets.