# Aiyou Chen

Aiyou Chen received a B.Sc degree in mathematics from Wuhan University in 1997, M.Sc degree from Peking University in 2000, and a Ph.D. degree in statistics from UC Berkeley in 2004. His Ph.D. dissertation was on semiparametric inference for independent component analysis. He was a Member of Technical Staff at Bell Labs, Murray Hill, NJ, from Aug 2004 to Aug 2010 with research in statistical methodologies and applications mostly with network data. Since then, he works as a Statistician at Google.

Authored Publications

Google Publications

Other Publications

Sort By

Bias Correction For Paid Search In Media Mix Modeling

Jim Koehler

Mike Perry

research.google.com, Google Inc.(2018)

Preview abstract
Evaluating the return on ad spend (ROAS), the causal effect of advertising
on sales, is critical to advertisers for understanding the performance of their
existing marketing strategy as well as how to improve and optimize it. Media
Mix Modeling (MMM) has been used as a convenient analytical tool to address
the problem using observational data. However it is well recognized that MMM
suffers from various fundamental challenges: data collection, model specification
and selection bias due to ad targeting, among others (Chan & Perry 2017; Wolfe
2016).
In this paper, we study the challenge associated with measuring the impact
of search ads in MMM, namely the selection bias due to ad targeting. Using
causal diagrams of the search ad environment, we derive a statistically principled
method for bias correction based on the back-door criterion (Pearl 2013).
We use case studies to show that the method provides promising results by
comparison with results from randomized experiments. We also report a more
complex case study where the advertiser had spent on more than a dozen media
channels but results from a randomized experiment are not available. Both our
theory and empirical studies suggest that in some common, practical scenarios,
one may be able to obtain an approximately unbiased estimate of search ad
ROAS.
View details

Data Enriched Linear Regression

Art Owen

Electronic Journal of Statistics, 9(2015), pp. 1078-1112 (to appear)

Preview abstract
We present a linear regression method for predictions on a small data set making use of a second possibly biased data set that may be much larger. Our method fits linear regressions to the two data sets while penalizing the difference between predictions made by those two models. The resulting algorithm is a shrinkage method similar to those used in small area estimation. We find a Stein-type result for Gaussian responses: when the model has 5 or more coefficients and 10 or more error degrees of freedom, it becomes inadmissible to use only the small data set, no matter how large the bias is. We also present both plug-in and AICc-based methods to tune our penalty parameter. Most of our results use an L2 penalty, but we obtain formulas for L1 penalized estimates when the model is specialized to the location setting. Ordinary Stein shrinkage provides an inadmissibility result for only 3 or more coefficients, but we find that our shrinkage method typically produces much lower squared errors in as few as 5 or 10 dimensions when the bias is small and essentially equivalent squared errors when the bias is large.
View details

Cluster forest

Donghui Yan

Michael I Jordan

Computational Statistics and Data Analysis, 66(2013), pp. 178-192

Preview abstract
With inspiration from Random Forests (RF) in the context of classification, a new clustering ensemble method---Cluster Forests (CF) is proposed. Geometrically, CF randomly probes a high-dimensional data cloud to obtain "good local clusterings" and then aggregates via spectral clustering to obtain cluster assignments for the whole dataset. The search for good local clusterings is guided by a cluster quality measure kappa. CF progressively improves each local clustering in a fashion that resembles the tree growth in RF. Empirical studies on several real-world datasets under two different performance metrics show that CF compares favorably to its competitors. Theoretical analysis reveals that the kappa measure makes it possible to grow the local clustering in a desirable way---it is "noise-resistant". A closed-form expression is obtained for the mis-clustering rate of spectral clustering under a perturbation model, which yields new insights into some aspects of spectral clustering.
View details

Distinct counting with a self-learning bitmap

Jin Cao

Larry Shepp

Tuan Nguyen

Journal of American Statistical Association, 106(2011), 879–890

Preview abstract
Counting the number of distinct elements (cardinality) in a dataset is a fundamental problem in database management. In recent years, due to many of its modern applications, there has been signiﬁcant interest to address the distinct counting problem in a data stream setting, where each incoming data can be seen only once and cannot be stored for long periods of time. Many probabilistic approaches based
on either sampling or sketching have been proposed in the computer science literature, that only require limited computing and memory resources. However, the performances of these methods are not scale invariant, in the sense that their relative root mean square estimation errors (RRMSE) depend on the unknown cardinalities. This is not desirable in many applications where cardinalities can be very dynamic
or inhomogeneous and many cardinalities need to be estimated. In this article, we develop a novel approach, called self-learning bitmap (S-bitmap) that is scale invariant for cardinalities in a speciﬁed range. S-bitmap uses a binary vector whose entries are updated from 0 to 1
by an adaptive sampling process for inferring the unknown cardinality, where the sampling rates are reduced sequentially as more and more entries change from 0 to 1. We prove rigorously that the S-bitmap estimate is not only unbiased but scale invariant. We demonstrate that to
achieve a small RRMSE value of ε or less, our approach requires significantly less memory and consumes similar or less operations than state-of-the-art methods for many common practice cardinality scales. Both simulation and experimental studies are reported.
View details

Efficient Spectral Neighborhood Blocking for Entity Resolution

Liangcai Shu

Ming Xiong

Weiyi Meng

International Conference on Data Engineering 2011 (ICDE), IEEE, pp. 1-12

Preview abstract
In many telecom and web applications, there is a
need to identify whether data objects in the same source or
different sources represent the same entity in the real world. This problem arises for subscribers in multiple services, customers in supply chain management, and users in social networks when there lacks a unique identifier across multiple data sources to represent a real-world entity. Entity resolution is to identify and discover objects in the data sets that refer to the same entity in the real world.
We investigate the entity resolution problem for large data sets where efficient and scalable solutions are needed. We propose a novel unsupervised blocking algorithm, namely SPectrAl Neighborhood (SPAN), which constructs a fast bipartition tree for the records based on spectral clustering such that real entities can be identified accurately by neighborhood records in the tree. There
are two major novel aspects in our approach: 1) We develop a fast algorithm that performs spectral clustering without computing pairwise similarities explicitly, which dramatically improves the scalability of the standard spectral clustering algorithm; 2) We utilize a stopping criterion specified by Newman-Girvan modularity in the bipartition process. Our experimental results with both synthetic and real-world data demonstrate that SPAN is robust and outperforms other blocking algorithms in terms of
accuracy while it is efficient and scalable to deal with large data sets.
View details

An Algorithm for Fast, Model-Free Tracking Indoors

Preview
Christina Harko

Diane Lambert

P. A. Whiting

ACM SIGMOBILE Mobile Computing and Communications Review(2007)

Network Tomography: Identifiability and Fourier Domain Estimation

Preview abstract
The statistical problem for network tomography is
to infer the distribution of X, with mutually independent components,
from a measurement model Y=AX, where A is a given
binary matrix representing the routing topology of a network
under consideration. The challenge is that the dimension of X is
much larger than that of Y and thus the problem is often ill-posed.
This paper studies some statistical aspects of network tomography.
We first develop a unifying theory on the identifiability of the distribution
of X. We then focus on an important instance of network
tomography—network delay tomography, where the problem is
to infer internal link delay distributions using end-to-end delay
measurements. We propose a novel mixture model for link delays
and develop a fast algorithm for estimation based on the General
Method of Moments. Through extensive model simulations and
real Internet trace driven simulation, the proposed approach is
shown to be favorable to previous methods using simple discretization
for inferring link delays in a heterogeneous network.
View details

Tracking Cardinality Distributions in Network Traffic

A Regression Paradox for Linear Models: Sufficient Conditions and Relation to Simpson's Paradox

Preview abstract
An analysis of customer survey data using direct and reverse linear regression leads to inconsistent conclusions with respect to the effect of a group variable. This counterintuitive phenomenon, called the regression paradox, causes seemingly contradictory group effects when the predictor and regressand are interchanged. Using analytical developments as well as geometric arguments, we describe sufficient conditions under which the regression paradox will appear in linear Gaussian models. The results show that the phenomenon depends on a distribution shift between the groups relative to the predictability of the model. As a consequence, the paradox can appear naturally in certain distributions, and may not be caused by sampling error or incorrectly specified models. Simulations verify that the paradox may appear in more general, non-Gaussian settings. An interesting, geometric connection to Simpsons paradox is provided.
View details

A nonparametric view of network models and Newman-Girvan and Other Modularities

Peter Bickel

Proceedings of the National Academy of Sciences of USA, 106(2009), pp. 21068-21073

Preview abstract
Prompted by the increasing interest in networks in many fields, we present an attempt at unifying points of view and analyses of these objects coming from the social sciences, statistics, probability and physics communities. We apply our approach to the Newman-Girvan modularity, widely used for "community" detection, among others. Our analysis is asymptotic but we show by simulation and application to real examples that the theory is a reasonable guide to practice.
View details

Identifying High Cardinality Internet Hosts

Jin Cao

Jin Yu

Tian Bu

Zhili Zhang

Proceedings of the 28th Conference on Computer Communications, IEEE INFOCOM(2009)

Detecting subscribers using {NAT}ed devices in wireless data networks

A Quasi-Likelihood Approach for accurate traffic matrix estimation in a high speed network

Jin Cao

Proceedings of the 27th Conference on Computer Communications, IEEE INFOCOM(2008)

A Fast and Compact Method for Unveiling Significant Patterns in High Speed Networks

Tian Bu

Jin Cao

Patrick Lee

Proceedings of the 26th IEEE International Conference on Computer Communications. IEEE INFOCOM(2007)

Preview abstract
Identification of significant patterns in network traffic, such as IPs or flows that contribute large volume (heavy hitters) or introduce large changes (heavy changers), has many applications in accounting and network anomaly detection. As network speed and the number of flows grow rapidly, tracking per-IP or per-flow statistics becomes infeasible due to both the computational overhead and memory requirements. In this paper, we propose a novel sequential hashing scheme that requires only O(H log N) both in memory and computational overhead that are close to being optimal, where N is the the number of all possible keys (e.g., flows, IPs) and H is the maximum number of heavy keys. Moreover, the generalized sequential hashing scheme makes it possible to trade off among memory, update cost, and detection cost in a large range that can be utilized by different computer architectures for optimizing the overall performance. In addition, we also propose statistically efficient algorithms for estimating the values of heavy hitters and heavy changers. Using both theoretical analysis and experimental studies of Internet traces, we demonstrate that our approach can achieve the same accuracy as the existing methods do but using much less memory and computational overhead.
View details

Network tomography based on 1-D projections

Jin Cao

Complex Datasets and Inverse Problems: Tomography, Networks and Beyond (Regina Liu, William Strawderman and Cun-Hui Zhang, eds.)(2007), pp. 45-61

A Simple and Efficient Estimation Method for Stream Expression Cardinalities

Jin Cao

Tian Bu

Proceedings of the 33rd international conference on Very large data bases (VLDB)(2007)

Preview abstract
Estimating the cardinality (i.e. number of distinct elements) of an arbitrary set expression defined over multiple distributed streams is one of the most fundamental queries of interest. Earlier methods based on probabilistic sketches have focused mostly on the sketching algorithms. However, the estimators do not fully utilize the information in the sketches and thus are not statistically efficient. In this paper, we develop a novel statistical model and an efficient yet simple estimator for the cardinalities based on a continuous variant of the well known Flajolet-Martin sketches. Specifically, we show that, for two streams, our estimator has almost the same statistical efficiency as the Maximum Likelihood Estimator (MLE), which is known to be optimal in the sense of Cramer-Rao lower bounds under regular conditions. Moreover, as the number of streams gets larger, our estimator is still computationally simple, but the MLE becomes intractable due to the complexity of the likelihood. Let N be the cardinality of the union of all streams, and |S| be the cardinality of a set expression S to be estimated. For a given relative standard error δ, the memory requirement of our estimator is O(δ-2|S|-1 N log log N), which is superior to state-of-the-art algorithms, especially for large N and small |S|/N where the estimation is most challenging.
View details

Efficient independent component analysis

Preview abstract
Independent component analysis (ICA) has been widely used for blind source separation in many fields, such as brain imaging analysis, signal processing and telecommunication. Many statistical techniques based on M-estimates have been proposed for estimating the mixing matrix. Recently, several nonparametric methods have been developed, but in-depth analysis of asymptotic efficiency has not been available. We analyze ICA using semiparametric theories and propose a straightforward estimate based on the efficient score function by using B-spline approximations. The estimate is asymptotically efficient under moderate conditions and exhibits better performance than standard ICA methods in a variety of simulations.
View details

Design and Evaluation of a Fast and Robust Worm Detection Algorithm

Tian Bu

Scott Vander Wiel

Thomas Woo

INFOCOM 2006. 25th IEEE International Conference on Computer Communications. Proceedings

Fast Kernel Density Independent Component Analysis

Independent Component Analysis and Blind Signal Separation (Lecture Notes in Computer Science), Springer Berlin / Heidelberg(2006), pp. 24-31

Preview abstract
We develop a super-fast kernel density estimation algorithm (FastKDE) and based on this a fast kernel independent component analysis algorithm (KDICA). FastKDE calculates the kernel density estimator exactly and its computation only requires sorting n numbers plus roughly 2n evaluations of the exponential function, where n is the sample size. KDICA converges as quickly as parametric ICA algorithms such as FastICA. By comparing with state-of-the-art ICA algorithms, simulation studies show that KDICA is promising for practical usages due to its computational efficiency as well as statistical efficiency. Some statistical properties of KDICA are analyzed.
View details

Consistent independent component analysis and prewhitening

Preview abstract
We study the statistical merits of two techniques used in the literature of independent component analysis (ICA). First, we analyze the characteristic-function based ICA method (CHFICA) and study its statistical properties such as consistency, √n-consistency, and robustness against small additive noise. Second, we study the validity of prewhitening: a preprocessing technique used by many ICA algorithms, as applied to the CHFICA method. In particular, we establish the surprising effectiveness of this technique even when some components have heavy tails and others do not. A fast new algorithm implementing the prewhitened CHFICA method is also provided.
View details