In this paper, we compare a variety of methods for causal inference through simulation, examining their sensitivity to and asymptotic behavior in the presence of correlation between (heterogeneous) treatment effect size and propensity to be treated, as well as their robustness to model mis-specification. We limit our focus to well-established methods relevant to the estimation of sales lift, which initially motivated this paper and serves as an illustrative example throughout. We demonstrate that popular matching methods often fail to adequately debias lift estimates, and that even doubly robust estimators, when naively implemented, fail to deliver statistically valid confidence intervals. The culprit is inadequate standard error estimators, which often yield insufficient confidence interval coverage because they fail to take into account uncertainty at early stages of the causal model. As an alternative, we discuss a more reliable approach: the use of a doubly robust point estimator with a sandwich standard error estimator.