# Andrei Z. Broder

### Research Areas

Authored Publications

Google Publications

Other Publications

Sort By

Learning Effective Embeddings for Machine Generated Emails with Applications to Email Category Prediction

Yu Sun

Luis Garcia Pueyo

Proceedings of the IEEE International Conference on Big Data(2018), pp. 1846-1855

Preview abstract
Machine-generated business-to-consumer (B2C) emails such as receipts, newsletters, and promotions constitute today a large portion of users' inbox. These emails reflect the users' interests and often are sequentially correlated, e.g., users interested in relocating may receive a sequence of messages on housing, moving, job availability, etc. We aim to infer (and eventually serve) the users' future interests by predicting the categories of their future emails. There are many good methods such as recurrent neural networks that can be applied for such predictions but in all cases the key to better performance is an effective representation of emails and users. To this end, we propose a general framework for embedding learning for emails and users, using as input only the sequence of B2C templates users receive and open. (A template is a B2C email stripped of all transient information related to specific users.) These learned embeddings allow us to identify both sequentially correlated emails and users with similar sequential interests. We can also use the learned embeddings either as input features or embedding initializers for email category predictions. Extensive experiments with millions of fully anonymized B2C emails demonstrate that the learned embeddings can significantly improve the prediction accuracy for future email categories. We hope that this effective yet simple embedding learning framework will inspire new machine intelligence applications that will improve the users' email experience.
View details

Email Category Prediction

Aston Zhang

Luis Garcia Pueyo

Companion Proc. of the 26th International World Wide Web Conference(2017), pp. 495-503

Preview abstract
According to recent estimates, about 90% of consumer received emails are machine-generated. Such messages include shopping receipts, promotional campaigns, newsletters, booking confirmations, etc. Most such messages are created by populating a fixed template with a small amount of personalized information, such as name, salutation, reservation numbers, dates, etc. Web mail providers (Gmail, Hotmail, Yahoo) are leveraging the structured nature of such emails to extract salient information and use it to improve the user experience: e.g. by automatically entering reservation data into a user calendar, or by sending alerts about upcoming shipments. To facilitate these extraction tasks it is helpful to classify templates according to their category, e.g. restaurant reservations or bill reminders, since each category triggers a particular user experience.
Recent research has focused on discovering the causal thread of templates, e.g. inferring that a shopping order is usually followed by a shipping confirmation, an airline booking is followed by a confirmation and then by a “ready to check in” message, etc. Gamzu et al. took this idea one step further by implementing a method to predict the template category of future emails for a given user based on previously received templates. The motivation is that predicting future emails has a wide range of potential applications, including better user experiences (e.g. warning users of items ordered but not shipped), targeted advertising (e.g. users that recently made a flight reservation may be interested in hotel reservations), and spam classification (a message that is part of a legitimate causal thread is unlikely to be spam).
The gist of the Gamzu et al. approach is modeling the problem as a Markov chain, where the nodes are templates or temporal events (e.g. the first day of the month). This paper expands on their work by investigating the use of neural networks for predicting the category of emails that will arrive during a fixed-sized time window in the future. We consider two types of neural networks: multi-layer perceptrons (MLP), a type of feedforward neural network; and long short-term memory (LSTM), a type of recurrent neural network. For each type of neural network, we explore the effects of varying their configuration (e.g. number of layers or number of neurons) and hyper-parameters (e.g. drop-out ratio). We find that the prediction accuracy of neural networks vastly outperforms the Markov chain approach, and that LSTMs perform slightly better than MLPs. We offer some qualitative interpretation of our findings and identify some promising future directions.
View details

Scalable K-Means by ranked retrieval

Lluis Garcia-Pueyo

Vanja Josifovski

Srihari Venkatesan

Proceedings of the 7th ACM international conference on Web search and data mining, ACM, New York, NY, USA(2014), pp. 233-242

Preview abstract
The k-means clustering algorithm has a long history and a proven practical performance, however it does not scale to clustering millions of data points into thousands of clusters in high dimensional spaces. The main computational bottleneck is the need to recompute the nearest centroid for every data point at every iteration, aprohibitive cost when the number of clusters is large. In this paper we show how to reduce the cost of the k-means algorithm by large factors by adapting ranked retrieval techniques. Using a combination of heuristics, on two real life data sets the wall clock time per iteration is reduced from 445 minutes to less than 4, and from 705 minutes to 1.4, while the clustering quality remains within 0.5% of the k-means quality.
The key insight is to invert the process of point-to-centroid assignment by creating an inverted index over all the points and then using the current centroids as queries to this index to decide on cluster membership. In other words, rather than each iteration consisting of "points picking centroids", each iteration now consists of "centroids picking points". This is much more efficient, but comes at the cost of leaving some points unassigned to any centroid. We show experimentally that the number of such points is low and thus they can be separately assigned once the final centroids are decided. To speed up the computation we sparsify the centroids by pruning low weight features. Finally, to further reduce the running time and the number of unassigned points, we propose a variant of the WAND algorithm that uses the results of the intermediate results of nearest neighbor computations to improve performance.
View details

IR paradigms in computational advertising

Preview
SIGIR(2012), pp. 1019

Efficiently Evaluating Graph Constraints in Content-Based Publish/Subscribe

Preview
Shirshanka Das

Marcus Fontoura

Bhaskar Ghosh

Vanja Josifovski

Jayavel Shanmugasundaram

The 20th International World Wide Web Confererence (WWW 2011)

Information extraction meets relation databases

Preview
Davood Rafiei

Edward Chang

Patrick Pantel

CIKM '09: Proceeding of the 18th ACM conference on Information and knowledge management, ACM, New York, NY, USA(2009), pp. 897-897

Indexing Shared Content in Information Retrieval Systems

Preview
Nadav Eiron

Marcus Fontoura

Michael Herscovici

Ronny Lempel

John McPherson

Runping Qi

Eugene J. Shekita

EDBT(2006), pp. 313-330

Current trends in the integration of searching and browsing

Preview
Yoelle S. Maarek

Krishna Bharat

Susan T. Dumais

Steve Papa

Jan O. Pedersen

WWW (Special interest tracks and posters)(2005), pp. 793

Information retrieval challenges in computational advertising

An introduction to online targeted advertising: principles, implementation, controversies

IUI(2011), pp. 103-104

Highly Dimensional Problems in Computational Advertising

ECML/PKDD (1)(2011), pp. 5

Bid generation for advanced match in sponsored search

Evgeniy Gabrilovich

Vanja Josifovski

George Mavromatis

Alex J. Smola

WSDM(2011), pp. 515-524

Introduction to display advertising: a half-day tutorial

Web Page Summarization for Just-in-Time Contextual Advertising

Aris Anagnostopoulos

Evgeniy Gabrilovich

Vanja Josifovski

Lance Riedel

ACM TIST, 3(2011), pp. 14

Efficiently evaluating graph constraints in content-based publish/subscribe

Shirshanka Das

Marcus Fontoura

Bhaskar Ghosh

Vanja Josifovski

Jayavel Shanmugasundaram

WWW(2011), pp. 497-506

The Anatomy of the Long Tail of Consumer Demand

WAW(2010), pp. 1

Automatic generation of bid phrases for online advertising

Sujith Ravi

Evgeniy Gabrilovich

Vanja Josifovski

Sandeep Pandey

Bo Pang

Proceedings of the International Conference on Web Search and Data Mining (WSDM)(2010), pp. 341-350

Preview abstract
One of the most prevalent online advertising methods is textual advertising. To produce a textual ad, an advertiser must craft a short creative (the text of the ad) linking to a landing page, which describes the product or service being promoted. Furthermore, the advertiser must associate the creative to a set of manually chosen bid phrases representing those Web search queries that should trigger the ad. For efficiency, given a landing page, the bid phrases are often chosen first, and then for each bid phrase the creative is produced using a template. Nevertheless, an ad campaign (e.g., for a large retailer) might involve thousands of landing pages and tens or hundreds of thousands of bid phrases, hence the entire process is very laborious. Our study aims towards the automatic construction of online ad campaigns: given a landing page, we propose several algorithmic methods to generate bid phrases suitable for the given input. Such phrases must be both relevant (that is, reflect the content of the page) and well-formed (that is, likely to be used as queries to a Web search engine). To this end, we use a two phase approach. First, candidate bid phrases are generated by a number of methods, including a (monolingual) translation model capable of generating phrases not contained within the text of the input as well as previously ``unseen'' phrases. Second, the candidates are ranked in a probabilistic framework using both the translation model, which favors relevant phrases, as well as a bid phrase language model, which favors well-formed phrases. Empirical evaluation based on a real-life corpus of advertisercreated landing pages and associated bid phrases confirms the value of our approach, which successfully re-generates many of the human-crafted bid phrases and performs significantly better than a pure text extraction method.
View details

Anatomy of the long tail: ordinary people with extraordinary tastes

Exploiting site-level information to improve web search

Evgeniy Gabrilovich

Vanja Josifovski

George Mavromatis

Jane Wang

CIKM(2010), pp. 1393-1396

Information retrieval challenges in computational advertising

Search is dead!: long live search

Elizabeth F. Churchill

Marti Hearst

Barney Pell

WWW(2010), pp. 1337-1338

The New Frontier of Web Search Technology: Seven Challenges

Competing for users' attention: on the interplay between organic and sponsored search results

Cristian Danescu-Niculescu-Mizil

Evgeniy Gabrilovich

Vanja Josifovski

Bo Pang

WWW(2010), pp. 291-300

Nearest-neighbor caching for content-match applications

Sandeep Pandey

Flavio Chierichetti

Vanja Josifovski

Ravi Kumar

WWW(2009), pp. 441-450

A search-based method for forecasting ad impression in contextual advertising

Web Advertising

Online expansion of rare queries for sponsored search

Peter Ciccolo

Evgeniy Gabrilovich

Vanja Josifovski

Lance Riedel

Jeffrey Yuan

WWW(2009), pp. 511-520

Context transfer in search advertising

The Hiring Problem and Lake Wobegon Strategies

Adam Kirsch

Ravi Kumar

Michael Mitzenmacher

Eli Upfal

SIAM J. Comput., 39(2009), pp. 1233-1255

Algorithmic Challenge in Online Advertising

AAIM(2009), pp. 1

Cross-language query classification using web search for exogenous knowledge

Classifying search queries using the Web as a source of knowledge

Evgeniy Gabrilovich

Marcus Fontoura

Amruta Joshi

Vanja Josifovski

Lance Riedel

Tong Zhang

TWEB, 3(2009)

What happens after an ad click?: quantifying the impact of landing pages in web advertising

A note on search based forecasting of ad volume in contextual advertising

Reviewing the Reviewers: Characterizing Biases and Competencies using Socially Meaningful Attributes

Sihem Amer-Yahia

Alban Galland

AAAI Spring Symposium: Social Information Processing(2008), pp. 1-6

Introduction to special issue on query log analysis: Technology and ethics

To swing or not to swing: learning when (not) to advertise

Marcus Fontoura

Evgeniy Gabrilovich

Vanja Josifovski

Vanessa Murdock

Vassilis Plachouras

CIKM(2008), pp. 1003-1012

Computational advertising

SODA(2008), pp. 992

Cross-lingual query classification: a preliminary study

Xuerui Wang

Evgeniy Gabrilovich

Vanja Josifovski

Bo Pang

CIKM-iNEWS(2008), pp. 101-104

Search advertising using web relevance feedback

Peter Ciccolo

Marcus Fontoura

Evgeniy Gabrilovich

Vanja Josifovski

Lance Riedel

CIKM(2008), pp. 1013-1022

Computational advertising and recommender systems

RecSys(2008), pp. 1-2

Optimizing relevance and revenue in ad search: a query substitution approach

Filip Radlinski

Peter Ciccolo

Evgeniy Gabrilovich

Vanja Josifovski

Lance Riedel

SIGIR(2008), pp. 403-410

Effective and efficient classification on a search-engine model

The hiring problem and Lake Wobegon strategies

Adam Kirsch

Ravi Kumar

Michael Mitzenmacher

Eli Upfal

SODA(2008), pp. 1184-1193

Just-in-time contextual advertising

Aris Anagnostopoulos

Evgeniy Gabrilovich

Vanja Josifovski

Lance Riedel

CIKM(2007), pp. 331-340

Estimating rates of rare events at multiple resolutions

Deepak Agarwal

Deepayan Chakrabarti

Dejan Diklic

Vanja Josifovski

Mayssam Sayyadian

KDD(2007), pp. 16-25

Margin Based Active Learning

A semantic approach to contextual advertising

Robust classification of rare queries using web knowledge

Marcus Fontoura

Evgeniy Gabrilovich

Amruta Joshi

Vanja Josifovski

Tong Zhang

SIGIR(2007), pp. 231-238

The Next Generation Web Search and the Demise of the Classic IR Model

ECIR(2007), pp. 1

The Future of Web Search: From Information Retrieval to Information Supply

NGITS(2006), pp. 362

Efficient PageRank approximation via graph aggregation

Sampling Search-Engine Results

Estimating corpus size via queries

Marcus Fontoura

Vanja Josifovski

Ravi Kumar

Rajeev Motwani

Shubha U. Nabar

Rina Panigrahy

Ying Xu 0002

CIKM(2006), pp. 594-603

Modelling and Mining of Networked Information Spaces

Workshop on Algorithms and Models for the Web Graph

Effective and efficient classification on a search-engine model

Sampling search-engine results

How search engines shape the web

Byron Dom

Krishna Bharat

Jan O. Pedersen

Yoshinobu Tonomura

WWW (Special interest tracks and posters)(2005), pp. 879

Using XML to Query XML - From Theory to Practice

Sic transit gloria telae: towards an understanding of the web's decay

Efficient pagerank approximation via graph aggregation

Ronny Lempel

Farzin Maghoul

Jan O. Pedersen

WWW (Alternate Track Papers & Posters)(2004), pp. 484-485

Invited Talk: The Many Wonders of the Web Graph

CAAN(2004), pp. 154-154

Towards the next generation of enterprise search technology

Keynote Address - exploring, modeling, and using the web graph

SIGIR(2003), pp. 1

Efficient URL caching for world wide web crawling

Efficient query evaluation using a two-level retrieval process

A derandomization using min-wise independent permutations

Survey: Network Applications of Bloom Filters: A Survey

A taxonomy of web search

SIGIR Forum, 36(2002), pp. 3-10

Completeness and robustness properties of min-wise independent permutations

A general approach to dynamic packet routing with bounded buffers

A Comparison of Techniques to Find Mirrored Hosts on the WWW

Krishna Bharat

Monika Rauch Henzinger

IEEE Data Eng. Bull., 23(2000), pp. 21-26

Identifying and Filtering Near-Duplicate Documents

CPM(2000), pp. 1-10

Introduction: The Fourth International Workshop on Randomization and Approximation Techniques in Computer Science

ICALP Satellite Workshops(2000), pp. 1-2

Min-Wise versus linear independence (extended abstract)

Min-wise Independent Permutations: Theory and Practice

ICALP(2000), pp. 808

Graph structure in the Web

Ravi Kumar

Farzin Maghoul

Sridhar Rajagopalan

Raymie Stata

Janet L. Wiener

Computer Networks, 33(2000), pp. 309-320

Min-Wise Independent Permutations

Moses Charikar

Alan M. Frieze

Michael Mitzenmacher

J. Comput. Syst. Sci., 60(2000), pp. 630-659

Static and Dynamic Path Selection on Expander Graphs: A Random Walk Approach

Completeness and Robustness Properties of Min-Wise Independent Permutations

Mirror, Mirror on the Web: A Study of Host Pairs with Replicated Content

Balanced Allocations

A Comparison of Techniques to Find Mirrored Hosts on the WWW

Min-Wise Independent Permutations (Extended Abstract)

The Connectivity Server: Fast Access to Linkage Information on the Web

Krishna Bharat

Monika Rauch Henzinger

Puneet Kumar

Suresh Venkatasubramanian

Computer Networks, 30(1998), pp. 469-477

Optimal Construction of Edge-Disjoint Paths in Random Graphs

A Derandomization Using Min-Wise Independent Permutations

Dynamic Packet Routing on Arrays with Bounded Buffers

A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines

Syntactic Clustering of the Web

Steven C. Glassman

Mark S. Manasse

Geoffrey Zweig

Computer Networks, 29(1997), pp. 1157-1166

Counting Minimum Weight Spanning Trees

Static and Dynamic Path Selection on Expander Graphs: A Random Walk Approach (Preliminary Version)

Biased Random Walks

Yossi Azar

Anna R. Karlin

Nathan Linial

Steven Phillips

Combinatorica, 16(1996), pp. 1-18

Pattern-based Compression of Text Images

Dynamic Deflection Routing on Arrays (Preliminary Version)

A General Approach to Dynamic Packet Routing with Bounded Buffers (extended abstract)

An Efficient Algorithm for the Vertex-Disjoint Paths Problem in Random Graphs

The Worst-Case Running Time of the Random Simplex Algorithm is Exponential in the Height

Martin E. Dyer

Alan M. Frieze

Eli Upfal

Inf. Process. Lett., 56(1995), pp. 79-81

Balanced Allocations for Tree-Like Inputs

Alan M. Frieze

Carsten Lund

Steven Phillips

Nick Reingold

Inf. Process. Lett., 55(1995), pp. 329-332

Balanced allocations (extended abstract)

On-Line Load Balancing

Existence and Construction of Edge-Disjoint Paths on Expander Graphs

On the Problem of Approximating the Number of Bases of a Matroid

Finding Hidden Hamiltonian Cycles

Near-perfect Token Distribution

Alan M. Frieze

Eli Shamir

Eli Upfal

Random Struct. Algorithms, 5(1994), pp. 559-572

Trading Space for Time in Undirected s-t Connectivity

Optimal Construction of Edge-Disjoint Paths in Random Graphs

On the Satisfiability and Maximum Satisfiability of Random 3-CNF Formulas

On-line Choice of On-line Algorithms

Biased Random Walks

Existence and Construction of Edge Disjoint Paths on Expander Graphs

On-line Load Balancing (Extended Abstract)

Near-perfect Token Distribution

On the Parallel Complexity of Evaluating Game Trees

Finding Hidden Hamiltonian Cycles (Extended Abstract)

Multilevel Adaptive Hashing

The Cost Distribution of Clustering in Random Probing

Trading Space for Time in Undirected s-t Connectivity

Generating Random Spanning Trees

FOCS(1989), pp. 442-447

Errata to "How hard is to marry at random? (On the approximation of the permanent)"

STOC(1988), pp. 551

On Generating Solved Instances of Computational Problems

Eric Allender

Joan Feigenbaum

Lane A. Hemachandra

CRYPTO(1988), pp. 297-310

Bounds on the Cover Time (Preliminary Version)

On the Second Eigenvalue of Random Regular Graphs (Preliminary Version)

Efficient Fault-Tolerant Routings in Networks

How hard is to marry at random? (On the approximation of the permanent)

STOC(1986), pp. 50-58

Placing Tiles in the Plane

A Provably Secure Polynomial Approximation Scheme for the Distributed Lottery Problem (Extended Abstract)

PODC(1985), pp. 136-148

Efficient Fault Tolerant Routings in Networks

Flipping coins in many pockets (Byzantine agreement on uniformly random values)

The r-Stirling numbers

Discrete Mathematics, 49(1984), pp. 241-259