Aiyou Chen

Aiyou Chen

Aiyou Chen received a B.Sc degree in mathematics from Wuhan University in 1997, M.Sc degree from Peking University in 2000, and a Ph.D. degree in statistics from UC Berkeley in 2004. His Ph.D. dissertation was on semiparametric inference for independent component analysis. He was a Member of Technical Staff at Bell Labs, Murray Hill, NJ, from Aug 2004 to Aug 2010 with research in statistical methodologies and applications mostly with network data. Since then, he works as a Statistician at Google.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Evaluating the incremental return on ad spend (iROAS) of a prospective online marketing strategy (i.e., the ratio of the strategy’s causal effect on some response metric of interest relative to its causal effect on the ad spend) has become increasingly more important. Although randomized “geo experiments” are frequently employed for this evaluation, obtaining reliable estimates of iROAS can be challenging, as oftentimes only a small number of highly heterogeneous units are used. Moreover, advertisers frequently impose budget constraints on their ad spends which further complicates causal inference by introducing interference between the experimental units. In this paper we formulate a novel statistical framework for inferring the iROAS of online advertising from randomized paired geo experiment, which further motivates and provides new insights into Rosenbaum’s arguments on instrumental variables, and we propose and develop a robust, distribution-free and interpretable estimator “Trimmed Match” as well as a data-driven choice of the tuning parameter which may be of independent interest. We investigate the sensitivity of Trimmed Match to some violations of its assumptions and show that it can be more efficient than some alternative estimators based on simulated data. We then demonstrate its practical utility with real case studies. View details
    Preview abstract How to measure the incremental return on Ad spend (iROAS) is a fundamental problem for the online advertising industry. A standard modern tool is to run randomized geo experiments, where experimental units are non-overlapping ad-targetable geographical areas (Vaver & Koehler 2011). However, how to design a reliable and cost-effective geo experiment can be complicated, for example: 1) the number of geos is often small, 2) the response metric (e.g. revenue) across geos can be very heavy-tailed due to geo heterogeneity, and furthermore 3) the response metric can vary dramatically over time. To address these issues, we propose a robust nonparametric method for the design, called Trimmed Match Design (TMD), which extends the idea of Trimmed Match (Chen & Au 2019) and furthermore integrates the techniques of optimal subset pairing and sample splitting in a novel and systematic manner. Some simulation and real case studies are presented. We also point out a few open problems for future research. View details
    Semiparametric Estimation of Treatment Effects in Randomized Experiments
    Susan Athey
    Peter J. Bickel
    Guido W. Imbens
    Michael Pollmann
    Google LLC(2021)
    Preview abstract We develop new semiparametric methods for estimating treatment effects. We focus on a setting where the outcome distributions may be thick tailed, where treatment effects are small, where sample sizes are large and where assignment is completely random. This setting is of particular interest in recent experimentation in tech companies. We propose using parametric models for the treatment effects, as opposed to parametric models for the full outcome distributions. This leads to semiparametric models for the outcome distributions. We derive the semiparametric efficiency bound for this setting, and propose efficient estimators. In the case with a constant treatment effect one of the proposed estimators has an interesting interpretation as a weighted average of quantile treatment effects, with the weights proportional to (minus) the second derivative of the log of the density of the potential outcomes. Our analysis also results in an extension of Huber’s model and trimmed mean to include asymmetry and a simplified condition on linear combinations of order statistics, which may be of independent interest. View details
    Preview abstract Evaluating the return on ad spend (ROAS), the causal effect of advertising on sales, is critical to advertisers for understanding the performance of their existing marketing strategy as well as how to improve and optimize it. Media Mix Modeling (MMM) has been used as a convenient analytical tool to address the problem using observational data. However it is well recognized that MMM suffers from various fundamental challenges: data collection, model specification and selection bias due to ad targeting, among others (Chan & Perry 2017; Wolfe 2016). In this paper, we study the challenge associated with measuring the impact of search ads in MMM, namely the selection bias due to ad targeting. Using causal diagrams of the search ad environment, we derive a statistically principled method for bias correction based on the back-door criterion (Pearl 2013). We use case studies to show that the method provides promising results by comparison with results from randomized experiments. We also report a more complex case study where the advertiser had spent on more than a dozen media channels but results from a randomized experiment are not available. Both our theory and empirical studies suggest that in some common, practical scenarios, one may be able to obtain an approximately unbiased estimate of search ad ROAS. View details
    Data Enriched Linear Regression
    Art Owen
    Electronic Journal of Statistics, 9(2015), pp. 1078-1112 (to appear)
    Preview abstract We present a linear regression method for predictions on a small data set making use of a second possibly biased data set that may be much larger. Our method fits linear regressions to the two data sets while penalizing the difference between predictions made by those two models. The resulting algorithm is a shrinkage method similar to those used in small area estimation. We find a Stein-type result for Gaussian responses: when the model has 5 or more coefficients and 10 or more error degrees of freedom, it becomes inadmissible to use only the small data set, no matter how large the bias is. We also present both plug-in and AICc-based methods to tune our penalty parameter. Most of our results use an L2 penalty, but we obtain formulas for L1 penalized estimates when the model is specialized to the location setting. Ordinary Stein shrinkage provides an inadmissibility result for only 3 or more coefficients, but we find that our shrinkage method typically produces much lower squared errors in as few as 5 or 10 dimensions when the bias is small and essentially equivalent squared errors when the bias is large. View details
    Data enrichment for incremental reach estimation
    Jim Koehler
    Art Owen
    Nicolas Remy
    Google Inc.(2014), pp. 1-21 (to appear)
    Preview abstract There is increasing interest in measuring the overlap and/or incremental reach of cross-media campaigns. The direct method is to use a cross-media panel but these are expensive to scale across all media. Typically, the cross-media panel is too small to produce reliable estimates when the interest comes down to subsets of the population. An alternative is to combine information from a small cross-media panel with a larger, cheaper but potentially biased single media panel. In this article, we develop a data enrichment approach specifically for incremental reach estimation. The approach not only integrates information from both panels that takes into account potential panel bias, but borrows strength from modeling conditional dependence of cross-media reaches. We demonstrate the approach with data from six campaigns for estimating YouTube video ad incremental reach over TV. In a simulation directly modeled on the actual data, we find that data enrichment yields much greater accuracy than one would get by either ignoring the larger panel, or by using it in a data fusion. View details
    Pseudo-likelihood methods for community detection in large sparse networks
    Arash A Amini
    Peter Bickel
    Liza Levina
    Annals of Statistics(2013), pp. 1-27
    Preview abstract Many algorithms have been proposed for fitting network models with communities but most of them do not scale well to large networks, and often fail on sparse networks. Here we propose a new fast pseudo-likelihood method for fitting the stochastic block model for networks, as well as a variant that allows for an arbitrary degree distribution by conditioning on degrees. We show that the algorithms perform well under a range of settings, including on very sparse networks, and illustrate on the example of a network of political blogs. We also propose spectral clustering with perturbations, a method of independent interest, which works well on sparse networks where regular spectral clustering fails, and use it to provide an initial value for pseudo-likelihood. View details
    Cluster forest
    Donghui Yan
    Michael I Jordan
    Computational Statistics and Data Analysis, 66(2013), pp. 178-192
    Preview abstract With inspiration from Random Forests (RF) in the context of classification, a new clustering ensemble method---Cluster Forests (CF) is proposed. Geometrically, CF randomly probes a high-dimensional data cloud to obtain "good local clusterings" and then aggregates via spectral clustering to obtain cluster assignments for the whole dataset. The search for good local clusterings is guided by a cluster quality measure kappa. CF progressively improves each local clustering in a fashion that resembles the tree growth in RF. Empirical studies on several real-world datasets under two different performance metrics show that CF compares favorably to its competitors. Theoretical analysis reveals that the kappa measure makes it possible to grow the local clustering in a desirable way---it is "noise-resistant". A closed-form expression is obtained for the mis-clustering rate of spectral clustering under a perturbation model, which yields new insights into some aspects of spectral clustering. View details
    Distinct counting with a self-learning bitmap
    Jin Cao
    Larry Shepp
    Tuan Nguyen
    Journal of American Statistical Association, 106(2011), 879–890
    Preview abstract Counting the number of distinct elements (cardinality) in a dataset is a fundamental problem in database management. In recent years, due to many of its modern applications, there has been significant interest to address the distinct counting problem in a data stream setting, where each incoming data can be seen only once and cannot be stored for long periods of time. Many probabilistic approaches based on either sampling or sketching have been proposed in the computer science literature, that only require limited computing and memory resources. However, the performances of these methods are not scale invariant, in the sense that their relative root mean square estimation errors (RRMSE) depend on the unknown cardinalities. This is not desirable in many applications where cardinalities can be very dynamic or inhomogeneous and many cardinalities need to be estimated. In this article, we develop a novel approach, called self-learning bitmap (S-bitmap) that is scale invariant for cardinalities in a specified range. S-bitmap uses a binary vector whose entries are updated from 0 to 1 by an adaptive sampling process for inferring the unknown cardinality, where the sampling rates are reduced sequentially as more and more entries change from 0 to 1. We prove rigorously that the S-bitmap estimate is not only unbiased but scale invariant. We demonstrate that to achieve a small RRMSE value of ε or less, our approach requires significantly less memory and consumes similar or less operations than state-of-the-art methods for many common practice cardinality scales. Both simulation and experimental studies are reported. View details
    Efficient Spectral Neighborhood Blocking for Entity Resolution
    Liangcai Shu
    Ming Xiong
    Weiyi Meng
    International Conference on Data Engineering 2011 (ICDE), IEEE, pp. 1-12
    Preview abstract In many telecom and web applications, there is a need to identify whether data objects in the same source or different sources represent the same entity in the real world. This problem arises for subscribers in multiple services, customers in supply chain management, and users in social networks when there lacks a unique identifier across multiple data sources to represent a real-world entity. Entity resolution is to identify and discover objects in the data sets that refer to the same entity in the real world. We investigate the entity resolution problem for large data sets where efficient and scalable solutions are needed. We propose a novel unsupervised blocking algorithm, namely SPectrAl Neighborhood (SPAN), which constructs a fast bipartition tree for the records based on spectral clustering such that real entities can be identified accurately by neighborhood records in the tree. There are two major novel aspects in our approach: 1) We develop a fast algorithm that performs spectral clustering without computing pairwise similarities explicitly, which dramatically improves the scalability of the standard spectral clustering algorithm; 2) We utilize a stopping criterion specified by Newman-Girvan modularity in the bipartition process. Our experimental results with both synthetic and real-world data demonstrate that SPAN is robust and outperforms other blocking algorithms in terms of accuracy while it is efficient and scalable to deal with large data sets. View details
    Preview abstract Probability models on graphs are becoming increasingly important in many applications, but statistical tools for fitting such models are not yet well developed. Here we propose a general method of moments approach that can be used to fit a large class of probability models through empirical counts of certain patterns in a graph. We establish some general asymptotic properties of empirical graph moments and prove consistency of the estimates as the graph size grows for all ranges of the average degree including Ω(1). Additional results are obtained for the important special case of degree distributions. View details
    An Algorithm for Fast, Model-Free Tracking Indoors
    Christina Harko
    Diane Lambert
    P. A. Whiting
    ACM SIGMOBILE Mobile Computing and Communications Review(2007)
    Preview
    Network Tomography: Identifiability and Fourier Domain Estimation
    Jin Cao
    Tian Bu
    IEEE Transactions on Signal Processing, 58 (12)(2010), pp. 6029-6039
    Preview abstract The statistical problem for network tomography is to infer the distribution of X, with mutually independent components, from a measurement model Y=AX, where A is a given binary matrix representing the routing topology of a network under consideration. The challenge is that the dimension of X is much larger than that of Y and thus the problem is often ill-posed. This paper studies some statistical aspects of network tomography. We first develop a unifying theory on the identifiability of the distribution of X. We then focus on an important instance of network tomography—network delay tomography, where the problem is to infer internal link delay distributions using end-to-end delay measurements. We propose a novel mixture model for link delays and develop a fast algorithm for estimation based on the General Method of Moments. Through extensive model simulations and real Internet trace driven simulation, the proposed approach is shown to be favorable to previous methods using simple discretization for inferring link delays in a heterogeneous network. View details
    Tracking Cardinality Distributions in Network Traffic
    Erran Li Li
    Jin Cao
    INFOCOM 2009, IEEE
    A Regression Paradox for Linear Models: Sufficient Conditions and Relation to Simpson's Paradox
    Thomas Bengtsson
    Tin Kam Ho
    The American Statistician, 63(2009), pp. 218-225
    Preview abstract An analysis of customer survey data using direct and reverse linear regression leads to inconsistent conclusions with respect to the effect of a group variable. This counterintuitive phenomenon, called the regression paradox, causes seemingly contradictory group effects when the predictor and regressand are interchanged. Using analytical developments as well as geometric arguments, we describe sufficient conditions under which the regression paradox will appear in linear Gaussian models. The results show that the phenomenon depends on a distribution shift between the groups relative to the predictability of the model. As a consequence, the paradox can appear naturally in certain distributions, and may not be caused by sampling error or incorrectly specified models. Simulations verify that the paradox may appear in more general, non-Gaussian settings. An interesting, geometric connection to Simpsons paradox is provided. View details
    A nonparametric view of network models and Newman-Girvan and Other Modularities
    Peter Bickel
    Proceedings of the National Academy of Sciences of USA, 106(2009), pp. 21068-21073
    Preview abstract Prompted by the increasing interest in networks in many fields, we present an attempt at unifying points of view and analyses of these objects coming from the social sciences, statistics, probability and physics communities. We apply our approach to the Newman-Girvan modularity, widely used for "community" detection, among others. Our analysis is asymptotic but we show by simulation and application to real examples that the theory is a reasonable guide to practice. View details
    Identifying High Cardinality Internet Hosts
    Jin Cao
    Jin Yu
    Tian Bu
    Zhili Zhang
    Proceedings of the 28th Conference on Computer Communications, IEEE INFOCOM(2009)
    Detecting subscribers using {NAT}ed devices in wireless data networks
    Erran Li Li
    Tian Bu
    Scott Miller
    Bell Labs Technical Journals(2009), pp. 223-233
    A Quasi-Likelihood Approach for accurate traffic matrix estimation in a high speed network
    Jin Cao
    Proceedings of the 27th Conference on Computer Communications, IEEE INFOCOM(2008)
    A Fast and Compact Method for Unveiling Significant Patterns in High Speed Networks
    Tian Bu
    Jin Cao
    Patrick Lee
    Proceedings of the 26th IEEE International Conference on Computer Communications. IEEE INFOCOM(2007)
    Preview abstract Identification of significant patterns in network traffic, such as IPs or flows that contribute large volume (heavy hitters) or introduce large changes (heavy changers), has many applications in accounting and network anomaly detection. As network speed and the number of flows grow rapidly, tracking per-IP or per-flow statistics becomes infeasible due to both the computational overhead and memory requirements. In this paper, we propose a novel sequential hashing scheme that requires only O(H log N) both in memory and computational overhead that are close to being optimal, where N is the the number of all possible keys (e.g., flows, IPs) and H is the maximum number of heavy keys. Moreover, the generalized sequential hashing scheme makes it possible to trade off among memory, update cost, and detection cost in a large range that can be utilized by different computer architectures for optimizing the overall performance. In addition, we also propose statistically efficient algorithms for estimating the values of heavy hitters and heavy changers. Using both theoretical analysis and experimental studies of Internet traces, we demonstrate that our approach can achieve the same accuracy as the existing methods do but using much less memory and computational overhead. View details
    Network tomography based on 1-D projections
    Jin Cao
    Complex Datasets and Inverse Problems: Tomography, Networks and Beyond (Regina Liu, William Strawderman and Cun-Hui Zhang, eds.)(2007), pp. 45-61
    A Simple and Efficient Estimation Method for Stream Expression Cardinalities
    Jin Cao
    Tian Bu
    Proceedings of the 33rd international conference on Very large data bases (VLDB)(2007)
    Preview abstract Estimating the cardinality (i.e. number of distinct elements) of an arbitrary set expression defined over multiple distributed streams is one of the most fundamental queries of interest. Earlier methods based on probabilistic sketches have focused mostly on the sketching algorithms. However, the estimators do not fully utilize the information in the sketches and thus are not statistically efficient. In this paper, we develop a novel statistical model and an efficient yet simple estimator for the cardinalities based on a continuous variant of the well known Flajolet-Martin sketches. Specifically, we show that, for two streams, our estimator has almost the same statistical efficiency as the Maximum Likelihood Estimator (MLE), which is known to be optimal in the sense of Cramer-Rao lower bounds under regular conditions. Moreover, as the number of streams gets larger, our estimator is still computationally simple, but the MLE becomes intractable due to the complexity of the likelihood. Let N be the cardinality of the union of all streams, and |S| be the cardinality of a set expression S to be estimated. For a given relative standard error δ, the memory requirement of our estimator is O(δ-2|S|-1 N log log N), which is superior to state-of-the-art algorithms, especially for large N and small |S|/N where the estimation is most challenging. View details
    Efficient independent component analysis
    Peter Bickel
    Annals of Statistics, 34 (6)(2006), pp. 2825-2855
    Preview abstract Independent component analysis (ICA) has been widely used for blind source separation in many fields, such as brain imaging analysis, signal processing and telecommunication. Many statistical techniques based on M-estimates have been proposed for estimating the mixing matrix. Recently, several nonparametric methods have been developed, but in-depth analysis of asymptotic efficiency has not been available. We analyze ICA using semiparametric theories and propose a straightforward estimate based on the efficient score function by using B-spline approximations. The estimate is asymptotically efficient under moderate conditions and exhibits better performance than standard ICA methods in a variety of simulations. View details
    Design and Evaluation of a Fast and Robust Worm Detection Algorithm
    Tian Bu
    Scott Vander Wiel
    Thomas Woo
    INFOCOM 2006. 25th IEEE International Conference on Computer Communications. Proceedings
    Fast Kernel Density Independent Component Analysis
    Independent Component Analysis and Blind Signal Separation (Lecture Notes in Computer Science), Springer Berlin / Heidelberg(2006), pp. 24-31
    Preview abstract We develop a super-fast kernel density estimation algorithm (FastKDE) and based on this a fast kernel independent component analysis algorithm (KDICA). FastKDE calculates the kernel density estimator exactly and its computation only requires sorting n numbers plus roughly 2n evaluations of the exponential function, where n is the sample size. KDICA converges as quickly as parametric ICA algorithms such as FastICA. By comparing with state-of-the-art ICA algorithms, simulation studies show that KDICA is promising for practical usages due to its computational efficiency as well as statistical efficiency. Some statistical properties of KDICA are analyzed. View details
    Consistent independent component analysis and prewhitening
    Peter J Bickel
    IEEE Transactions on Signal Processing, 53 (10)(2005), pp. 3625 - 3632
    Preview abstract We study the statistical merits of two techniques used in the literature of independent component analysis (ICA). First, we analyze the characteristic-function based ICA method (CHFICA) and study its statistical properties such as consistency, √n-consistency, and robustness against small additive noise. Second, we study the validity of prewhitening: a preprocessing technique used by many ICA algorithms, as applied to the CHFICA method. In particular, we establish the surprising effectiveness of this technique even when some components have heavy tails and others do not. A fast new algorithm implementing the prewhitened CHFICA method is also provided. View details