Jump to Content
Aiyou Chen

Aiyou Chen

Aiyou Chen received a B.Sc degree in mathematics from Wuhan University in 1997, M.Sc degree from Peking University in 2000, and a Ph.D. degree in statistics from UC Berkeley in 2004. His Ph.D. dissertation was on semiparametric inference for independent component analysis. He was a Member of Technical Staff at Bell Labs, Murray Hill, NJ, from Aug 2004 to Aug 2010 with research in statistical methodologies and applications mostly with network data. Since then, he works as a Statistician at Google.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Evaluating the incremental return on ad spend (iROAS) of a prospective online marketing strategy (i.e., the ratio of the strategy’s causal effect on some response metric of interest relative to its causal effect on the ad spend) has become increasingly more important. Although randomized “geo experiments” are frequently employed for this evaluation, obtaining reliable estimates of iROAS can be challenging, as oftentimes only a small number of highly heterogeneous units are used. Moreover, advertisers frequently impose budget constraints on their ad spends which further complicates causal inference by introducing interference between the experimental units. In this paper we formulate a novel statistical framework for inferring the iROAS of online advertising from randomized paired geo experiment, which further motivates and provides new insights into Rosenbaum’s arguments on instrumental variables, and we propose and develop a robust, distribution-free and interpretable estimator “Trimmed Match” as well as a data-driven choice of the tuning parameter which may be of independent interest. We investigate the sensitivity of Trimmed Match to some violations of its assumptions and show that it can be more efficient than some alternative estimators based on simulated data. We then demonstrate its practical utility with real case studies. View details
    Preview abstract How to measure the incremental return on Ad spend (iROAS) is a fundamental problem for the online advertising industry. A standard modern tool is to run randomized geo experiments, where experimental units are non-overlapping ad-targetable geographical areas (Vaver & Koehler 2011). However, how to design a reliable and cost-effective geo experiment can be complicated, for example: 1) the number of geos is often small, 2) the response metric (e.g. revenue) across geos can be very heavy-tailed due to geo heterogeneity, and furthermore 3) the response metric can vary dramatically over time. To address these issues, we propose a robust nonparametric method for the design, called Trimmed Match Design (TMD), which extends the idea of Trimmed Match (Chen & Au 2019) and furthermore integrates the techniques of optimal subset pairing and sample splitting in a novel and systematic manner. Some simulation and real case studies are presented. We also point out a few open problems for future research. View details
    Semiparametric Estimation of Treatment Effects in Randomized Experiments
    Susan Athey
    Peter J. Bickel
    Guido W. Imbens
    Michael Pollmann
    Google LLC (2021)
    Preview abstract We develop new semiparametric methods for estimating treatment effects. We focus on a setting where the outcome distributions may be thick tailed, where treatment effects are small, where sample sizes are large and where assignment is completely random. This setting is of particular interest in recent experimentation in tech companies. We propose using parametric models for the treatment effects, as opposed to parametric models for the full outcome distributions. This leads to semiparametric models for the outcome distributions. We derive the semiparametric efficiency bound for this setting, and propose efficient estimators. In the case with a constant treatment effect one of the proposed estimators has an interesting interpretation as a weighted average of quantile treatment effects, with the weights proportional to (minus) the second derivative of the log of the density of the potential outcomes. Our analysis also results in an extension of Huber’s model and trimmed mean to include asymmetry and a simplified condition on linear combinations of order statistics, which may be of independent interest. View details
    Preview abstract Evaluating the return on ad spend (ROAS), the causal effect of advertising on sales, is critical to advertisers for understanding the performance of their existing marketing strategy as well as how to improve and optimize it. Media Mix Modeling (MMM) has been used as a convenient analytical tool to address the problem using observational data. However it is well recognized that MMM suffers from various fundamental challenges: data collection, model specification and selection bias due to ad targeting, among others (Chan & Perry 2017; Wolfe 2016). In this paper, we study the challenge associated with measuring the impact of search ads in MMM, namely the selection bias due to ad targeting. Using causal diagrams of the search ad environment, we derive a statistically principled method for bias correction based on the back-door criterion (Pearl 2013). We use case studies to show that the method provides promising results by comparison with results from randomized experiments. We also report a more complex case study where the advertiser had spent on more than a dozen media channels but results from a randomized experiment are not available. Both our theory and empirical studies suggest that in some common, practical scenarios, one may be able to obtain an approximately unbiased estimate of search ad ROAS. View details
    Data Enriched Linear Regression
    Art Owen
    Electronic Journal of Statistics, vol. 9 (2015), pp. 1078-1112 (to appear)
    Preview abstract We present a linear regression method for predictions on a small data set making use of a second possibly biased data set that may be much larger. Our method fits linear regressions to the two data sets while penalizing the difference between predictions made by those two models. The resulting algorithm is a shrinkage method similar to those used in small area estimation. We find a Stein-type result for Gaussian responses: when the model has 5 or more coefficients and 10 or more error degrees of freedom, it becomes inadmissible to use only the small data set, no matter how large the bias is. We also present both plug-in and AICc-based methods to tune our penalty parameter. Most of our results use an L2 penalty, but we obtain formulas for L1 penalized estimates when the model is specialized to the location setting. Ordinary Stein shrinkage provides an inadmissibility result for only 3 or more coefficients, but we find that our shrinkage method typically produces much lower squared errors in as few as 5 or 10 dimensions when the bias is small and essentially equivalent squared errors when the bias is large. View details
    Preview abstract There is increasing interest in measuring the overlap and/or incremental reach of cross-media campaigns. The direct method is to use a cross-media panel but these are expensive to scale across all media. Typically, the cross-media panel is too small to produce reliable estimates when the interest comes down to subsets of the population. An alternative is to combine information from a small cross-media panel with a larger, cheaper but potentially biased single media panel. In this article, we develop a data enrichment approach specifically for incremental reach estimation. The approach not only integrates information from both panels that takes into account potential panel bias, but borrows strength from modeling conditional dependence of cross-media reaches. We demonstrate the approach with data from six campaigns for estimating YouTube video ad incremental reach over TV. In a simulation directly modeled on the actual data, we find that data enrichment yields much greater accuracy than one would get by either ignoring the larger panel, or by using it in a data fusion. View details
    Pseudo-likelihood methods for community detection in large sparse networks
    Arash A Amini
    Peter Bickel
    Liza Levina
    Annals of Statistics (2013), pp. 1-27
    Preview abstract Many algorithms have been proposed for fitting network models with communities but most of them do not scale well to large networks, and often fail on sparse networks. Here we propose a new fast pseudo-likelihood method for fitting the stochastic block model for networks, as well as a variant that allows for an arbitrary degree distribution by conditioning on degrees. We show that the algorithms perform well under a range of settings, including on very sparse networks, and illustrate on the example of a network of political blogs. We also propose spectral clustering with perturbations, a method of independent interest, which works well on sparse networks where regular spectral clustering fails, and use it to provide an initial value for pseudo-likelihood. View details
    Cluster forest
    Donghui Yan
    Michael I Jordan
    Computational Statistics and Data Analysis, vol. 66 (2013), pp. 178-192
    Preview abstract With inspiration from Random Forests (RF) in the context of classification, a new clustering ensemble method---Cluster Forests (CF) is proposed. Geometrically, CF randomly probes a high-dimensional data cloud to obtain "good local clusterings" and then aggregates via spectral clustering to obtain cluster assignments for the whole dataset. The search for good local clusterings is guided by a cluster quality measure kappa. CF progressively improves each local clustering in a fashion that resembles the tree growth in RF. Empirical studies on several real-world datasets under two different performance metrics show that CF compares favorably to its competitors. Theoretical analysis reveals that the kappa measure makes it possible to grow the local clustering in a desirable way---it is "noise-resistant". A closed-form expression is obtained for the mis-clustering rate of spectral clustering under a perturbation model, which yields new insights into some aspects of spectral clustering. View details
    Preview abstract Probability models on graphs are becoming increasingly important in many applications, but statistical tools for fitting such models are not yet well developed. Here we propose a general method of moments approach that can be used to fit a large class of probability models through empirical counts of certain patterns in a graph. We establish some general asymptotic properties of empirical graph moments and prove consistency of the estimates as the graph size grows for all ranges of the average degree including Ω(1). Additional results are obtained for the important special case of degree distributions. View details
    Efficient Spectral Neighborhood Blocking for Entity Resolution
    Liangcai Shu
    Ming Xiong
    Weiyi Meng
    International Conference on Data Engineering 2011 (ICDE), IEEE, pp. 1-12
    Preview abstract In many telecom and web applications, there is a need to identify whether data objects in the same source or different sources represent the same entity in the real world. This problem arises for subscribers in multiple services, customers in supply chain management, and users in social networks when there lacks a unique identifier across multiple data sources to represent a real-world entity. Entity resolution is to identify and discover objects in the data sets that refer to the same entity in the real world. We investigate the entity resolution problem for large data sets where efficient and scalable solutions are needed. We propose a novel unsupervised blocking algorithm, namely SPectrAl Neighborhood (SPAN), which constructs a fast bipartition tree for the records based on spectral clustering such that real entities can be identified accurately by neighborhood records in the tree. There are two major novel aspects in our approach: 1) We develop a fast algorithm that performs spectral clustering without computing pairwise similarities explicitly, which dramatically improves the scalability of the standard spectral clustering algorithm; 2) We utilize a stopping criterion specified by Newman-Girvan modularity in the bipartition process. Our experimental results with both synthetic and real-world data demonstrate that SPAN is robust and outperforms other blocking algorithms in terms of accuracy while it is efficient and scalable to deal with large data sets. View details
    Distinct counting with a self-learning bitmap
    Jin Cao
    Larry Shepp
    Tuan Nguyen
    Journal of American Statistical Association, vol. 106 (2011), 879–890
    Preview abstract Counting the number of distinct elements (cardinality) in a dataset is a fundamental problem in database management. In recent years, due to many of its modern applications, there has been significant interest to address the distinct counting problem in a data stream setting, where each incoming data can be seen only once and cannot be stored for long periods of time. Many probabilistic approaches based on either sampling or sketching have been proposed in the computer science literature, that only require limited computing and memory resources. However, the performances of these methods are not scale invariant, in the sense that their relative root mean square estimation errors (RRMSE) depend on the unknown cardinalities. This is not desirable in many applications where cardinalities can be very dynamic or inhomogeneous and many cardinalities need to be estimated. In this article, we develop a novel approach, called self-learning bitmap (S-bitmap) that is scale invariant for cardinalities in a specified range. S-bitmap uses a binary vector whose entries are updated from 0 to 1 by an adaptive sampling process for inferring the unknown cardinality, where the sampling rates are reduced sequentially as more and more entries change from 0 to 1. We prove rigorously that the S-bitmap estimate is not only unbiased but scale invariant. We demonstrate that to achieve a small RRMSE value of ε or less, our approach requires significantly less memory and consumes similar or less operations than state-of-the-art methods for many common practice cardinality scales. Both simulation and experimental studies are reported. View details
    An Algorithm for Fast, Model-Free Tracking Indoors
    Christina Harko
    Diane Lambert
    P. A. Whiting
    ACM SIGMOBILE Mobile Computing and Communications Review (2007)
    Preview
    Network Tomography: Identifiability and Fourier Domain Estimation
    Jin Cao
    Tian Bu
    IEEE Transactions on Signal Processing, vol. 58 (12) (2010), pp. 6029-6039
    A Regression Paradox for Linear Models: Sufficient Conditions and Relation to Simpson's Paradox
    Thomas Bengtsson
    Tin Kam Ho
    The American Statistician, vol. 63 (2009), pp. 218-225
    Tracking Cardinality Distributions in Network Traffic
    Erran Li Li
    Jin Cao
    INFOCOM 2009, IEEE
    Identifying High Cardinality Internet Hosts
    Jin Cao
    Jin Yu
    Tian Bu
    Zhili Zhang
    Proceedings of the 28th Conference on Computer Communications, IEEE INFOCOM (2009)
    A nonparametric view of network models and Newman-Girvan and Other Modularities
    Peter Bickel
    Proceedings of the National Academy of Sciences of USA, vol. 106 (2009), pp. 21068-21073
    Detecting subscribers using NATed devices in wireless data networks
    Erran Li Li
    Tian Bu
    Scott Miller
    Bell Labs Technical Journals (2009), pp. 223-233
    A Quasi-Likelihood Approach for accurate traffic matrix estimation in a high speed network
    Jin Cao
    Proceedings of the 27th Conference on Computer Communications, IEEE INFOCOM (2008)
    A Fast and Compact Method for Unveiling Significant Patterns in High Speed Networks
    Tian Bu
    Jin Cao
    Patrick Lee
    Proceedings of the 26th IEEE International Conference on Computer Communications. IEEE INFOCOM (2007)
    A Simple and Efficient Estimation Method for Stream Expression Cardinalities
    Jin Cao
    Tian Bu
    Proceedings of the 33rd international conference on Very large data bases (VLDB) (2007)
    Network tomography based on 1-D projections
    Jin Cao
    Complex Datasets and Inverse Problems: Tomography, Networks and Beyond (Regina Liu, William Strawderman and Cun-Hui Zhang, eds.) (2007), pp. 45-61
    Design and Evaluation of a Fast and Robust Worm Detection Algorithm
    Tian Bu
    Scott Vander Wiel
    Thomas Woo
    INFOCOM 2006. 25th IEEE International Conference on Computer Communications. Proceedings
    Efficient independent component analysis
    Peter Bickel
    Annals of Statistics, vol. 34 (6) (2006), pp. 2825-2855
    Fast Kernel Density Independent Component Analysis
    Independent Component Analysis and Blind Signal Separation (Lecture Notes in Computer Science), Springer Berlin / Heidelberg (2006), pp. 24-31
    Consistent independent component analysis and prewhitening
    Peter J Bickel
    IEEE Transactions on Signal Processing, vol. 53 (10) (2005), pp. 3625 - 3632