Tim Au
Tim is currently a Data Science manager at Google, where he focuses on building statistical models for Google's ads business. Prior to Google, he received a BA with Distinction in Mathematics and Economics from Cornell University in 2010, and a PhD in Statistics from Duke University in 2014.
Authored Publications
Sort By
Preview abstract
Evaluating the incremental return on ad spend (iROAS) of a prospective online marketing strategy (i.e., the ratio of the strategy’s causal effect on some response metric of interest relative to its causal effect on the ad spend) has become increasingly more important. Although randomized “geo experiments” are frequently employed for this evaluation, obtaining reliable estimates of iROAS can be challenging, as oftentimes only a small number of highly heterogeneous units are used. Moreover, advertisers frequently impose budget constraints on their ad spends which further complicates causal
inference by introducing interference between the experimental units. In this paper we formulate a novel statistical framework for inferring the iROAS of online advertising from randomized paired geo experiment, which further motivates and provides new insights into Rosenbaum’s arguments on instrumental variables, and we propose and develop a robust, distribution-free and interpretable estimator “Trimmed Match” as well as a data-driven choice of the tuning parameter which may be of independent interest. We investigate the sensitivity of Trimmed Match to some violations of its assumptions and show that it can be more efficient than some alternative estimators based on
simulated data. We then demonstrate its practical utility with real case studies.
View details
Preview abstract
Although randomized controlled trials are regarded as the "gold standard" for causal inference, advertisers have been hesitant to embrace them as their primary method of experimental design and analysis due to technical difficulties in implementing them in the online advertising context. To help mitigate some of these challenges while still providing the rigor of a randomized controlled trial, Vaver and Koehler (2011) introduced the concept of a "geo experiment." However, it may not always be possible to rely on randomization when designing a geo experiment. For example, it may not be realistic to expect randomization to create balanced experimental groups when some of the geos are markedly different from all of the others or when there are only a few geos available for experimentation. In addition, randomization may not always be feasible given some of the specific requirements that advertisers often must impose on their experiments in practice---such as the need to run a smaller scale geo experiment within a given budget or the need to include certain geos in specific experimental groups. Consequently, advertisers may sometimes prefer to forgo some of the benefits of randomization, and in this paper we introduce a more systematic "matched markets" approach that, subject to the advertiser's constraints, greedily searches for experimental group assignments that appear to satisfy some of the critical assumptions of the "Time-Based Regression" (TBR) model for analyzing geo experiments that was introduced in Kerman et al. (2017). If the modeling assumptions of TBR do indeed hold, then the experimental designs that are recommended by our matched markets approach lead to straightforward causal estimates.
View details
Random Forests, Decision Trees, and Categorical Predictors: The ``Absent Levels'' Problem
Journal of Machine Learning Research, 19 (2018), pp. 1-30
Preview abstract
One advantage of decision tree based methods like random forests is their ability to natively handle categorical predictors without having to first transform them (e.g., by using feature engineering techniques). However, in this paper, we show how this capability can lead to an inherent "absent levels" problem for decision tree based methods that has never been thoroughly discussed, and whose consequences have never been carefully explored. This problem occurs whenever there is an indeterminacy over how to handle an observation that has reached a categorical split which was determined when the observation in question's level was absent during training. Although these incidents may appear to be innocuous, by using Leo Breiman and Adele Cutler's random forests FORTRAN code and the randomForest R package (Liaw and Wiener, 2002) as motivating case studies, we examine how overlooking the absent levels problem can systematically bias a model. Furthermore, by using three real data examples, we illustrate how absent levels can dramatically alter a model's performance in practice, and we empirically demonstrate how some simple heuristics can be used to help mitigate the effects of the absent levels problem until a more robust theoretical solution is found.
View details