Yunting Sun
Research Areas
Authored Publications
Sort By
Preview abstract
Dealing with biased data samples is a common task across many statistical fields. In survey sampling, bias often occurs due to the unrepresentative samples. In causal studies with observational data, the treated and untreated groups usually have discrepancy in their covariate distributions. Empirical calibration is a generic weighting method that presents a unified view on correcting or reducing the data biases for the tasks mentioned above. We provide a Python library EC to compute the empirical calibration weights. The problem is formulated as a convex optimization and solved efficiently in the dual form. Compared to existing software, EC is both more efficient and robust. EC also accommodates different optimization objectives, supports weight clipping, and allows inexact calibration which improves the usability. We demonstrate its usage across various experiments with both simulated and real-world data.
View details
Bias Correction For Paid Search In Media Mix Modeling
Jim Koehler
Mike Perry
research.google.com, Google Inc. (2018)
Preview abstract
Evaluating the return on ad spend (ROAS), the causal effect of advertising
on sales, is critical to advertisers for understanding the performance of their
existing marketing strategy as well as how to improve and optimize it. Media
Mix Modeling (MMM) has been used as a convenient analytical tool to address
the problem using observational data. However it is well recognized that MMM
suffers from various fundamental challenges: data collection, model specification
and selection bias due to ad targeting, among others (Chan & Perry 2017; Wolfe
2016).
In this paper, we study the challenge associated with measuring the impact
of search ads in MMM, namely the selection bias due to ad targeting. Using
causal diagrams of the search ad environment, we derive a statistically principled
method for bias correction based on the back-door criterion (Pearl 2013).
We use case studies to show that the method provides promising results by
comparison with results from randomized experiments. We also report a more
complex case study where the advertiser had spent on more than a dozen media
channels but results from a randomized experiment are not available. Both our
theory and empirical studies suggest that in some common, practical scenarios,
one may be able to obtain an approximately unbiased estimate of search ad
ROAS.
View details
Bayesian Methods for Media Mix Modeling with Carryover and Shape Effects
Jim Koehler
research.google.com, Google Inc., 76 Ninth Avenue
Google New York
NY 10011 (2017)
Preview abstract
Media mix models are used by advertisers to measure the effectiveness of their advertising and provide insight in making future budget allocation decisions. Advertising usually has lag effects and diminishing returns, which are hard to capture using linear regression. In this paper, we propose a media mix model with flexible functional forms to model the carryover and shape effects of advertising. The model is estimated using a Bayesian approach in order to make use of prior knowledge accumulated in previous or related media mix models. We illustrate how to calculate attribution metrics such as ROAS and mROAS from posterior samples on simulated data sets. Simulation studies show that the model can be estimated very well for large size data sets, but prior distributions have a big impact on the posteriors when the sample size is small and may lead to biased estimates. We apply the model to data from a shampoo advertiser, and use Bayesian Information Criterion (BIC) to choose the appropriate specification of the functional forms for the carryover and shape effects. We further illustrate that the optimal media mix based on the model has a large variance due to the variance of the parameter estimates.
View details
Preview abstract
One of the major problems in developing media mix models is that the data that is generally available to the modeler lacks sufficient quantity and information content to reliably estimate the parameters in a model of even moderate complexity. Pooling data from different brands within the same product category provides more observations and greater variability in media spend patterns. We either directly use the results from a hierarchical Bayesian model built on the category dataset, or pass the information learned from the category model to a brand-specific media mix model via informative priors within a Bayesian framework, depending on the data sharing restriction across brands. We demonstrate using both simulation and real case studies that our category analysis can improve parameter estimation and reduce uncertainty of model prediction and extrapolation.
View details
Preview abstract
Media mix modeling is a statistical analysis on historical data to measure the return on investment
(ROI) on advertising and other marketing activities. Current practice usually utilizes data aggregated
at a national level, which often suffers from small sample size and insufficient variation in
the media spend. When sub-national data is available, we propose a geo-level Bayesian hierarchical
media mix model (GBHMMM), and demonstrate that the method generally provides estimates
with tighter credible intervals compared to a model with national level data alone. This reduction
in error is due to having more observations and useful variability in media spend, which can protect
advertisers from unsound reallocation decisions. Under some weak conditions, the geo-level model
can reduce the ad targeting bias. When geo-level data is not available for all the media channels,
the geo-level model estimates generally deteriorate as more media variables are imputed using the
national level data
View details
Preview abstract
Many empirical micro-economics studies rely on consumer panels. For example, TV and web metering panels track TV and online usage of individuals. Sometimes more than one panel are available although these panels use different metering technologies and are subject to varying degrees of missingness. The problem we consider here is how to combine imputation based on two panels which have similar but not identical statistical characteristics. In the US, we have two two-screen panels, panel A (TV + desktop) and panel B(desktop + mobile) which are both calibrated to the US internet population. We want to estimate a count of ad impressions across all three-screens. As desktop impressions are metered in both panels, we fit a joint imputation model by pooling observed desktop impression counts across panels. After imputation on panel B, we fit a truncated negative binomial hurdle regression of mobile impression count over desktop impression count, demographic information, etc. And then, for each panelist in the panel A, we predict his/her mobile impression counts. In this way, we 'impute' mobile impressions in the panel A to facilitate three-screens measurements.
View details
Preview abstract
Through a detailed analysis of logs of activity for
all Google employees, this paper shows how the
Google Docs suite (documents, spreadsheets and
slides) enables and increases collaboration within
Google. In particular, visualization and analysis
of the evolution of Google’s collaboration network show that new employees, have started collaborating more quickly and with more people as usage of Docs has grown. Over the last two years, the percentage of new employees who collaborate on Docs per month has risen from 70% to 90% and the percentage who collaborate with more than two people has doubled from 35% to 70%. Moreover, the culture of collaboration has become more open, with public sharing within Google overtaking private sharing.
View details