# Andrei Z. Broder

### Research Areas

Authored Publications

Google Publications

Other Publications

Sort By

Learning Effective Embeddings for Machine Generated Emails with Applications to Email Category Prediction

Yu Sun

Luis Garcia Pueyo

Proceedings of the IEEE International Conference on Big Data (2018), pp. 1846-1855

Preview abstract
Machine-generated business-to-consumer (B2C) emails such as receipts, newsletters, and promotions constitute today a large portion of users' inbox. These emails reflect the users' interests and often are sequentially correlated, e.g., users interested in relocating may receive a sequence of messages on housing, moving, job availability, etc. We aim to infer (and eventually serve) the users' future interests by predicting the categories of their future emails. There are many good methods such as recurrent neural networks that can be applied for such predictions but in all cases the key to better performance is an effective representation of emails and users. To this end, we propose a general framework for embedding learning for emails and users, using as input only the sequence of B2C templates users receive and open. (A template is a B2C email stripped of all transient information related to specific users.) These learned embeddings allow us to identify both sequentially correlated emails and users with similar sequential interests. We can also use the learned embeddings either as input features or embedding initializers for email category predictions. Extensive experiments with millions of fully anonymized B2C emails demonstrate that the learned embeddings can significantly improve the prediction accuracy for future email categories. We hope that this effective yet simple embedding learning framework will inspire new machine intelligence applications that will improve the users' email experience.
View details

Email Category Prediction

Aston Zhang

Luis Garcia Pueyo

Companion Proc. of the 26th International World Wide Web Conference (2017), pp. 495-503

Preview abstract
According to recent estimates, about 90% of consumer received emails are machine-generated. Such messages include shopping receipts, promotional campaigns, newsletters, booking confirmations, etc. Most such messages are created by populating a fixed template with a small amount of personalized information, such as name, salutation, reservation numbers, dates, etc. Web mail providers (Gmail, Hotmail, Yahoo) are leveraging the structured nature of such emails to extract salient information and use it to improve the user experience: e.g. by automatically entering reservation data into a user calendar, or by sending alerts about upcoming shipments. To facilitate these extraction tasks it is helpful to classify templates according to their category, e.g. restaurant reservations or bill reminders, since each category triggers a particular user experience.
Recent research has focused on discovering the causal thread of templates, e.g. inferring that a shopping order is usually followed by a shipping confirmation, an airline booking is followed by a confirmation and then by a “ready to check in” message, etc. Gamzu et al. took this idea one step further by implementing a method to predict the template category of future emails for a given user based on previously received templates. The motivation is that predicting future emails has a wide range of potential applications, including better user experiences (e.g. warning users of items ordered but not shipped), targeted advertising (e.g. users that recently made a flight reservation may be interested in hotel reservations), and spam classification (a message that is part of a legitimate causal thread is unlikely to be spam).
The gist of the Gamzu et al. approach is modeling the problem as a Markov chain, where the nodes are templates or temporal events (e.g. the first day of the month). This paper expands on their work by investigating the use of neural networks for predicting the category of emails that will arrive during a fixed-sized time window in the future. We consider two types of neural networks: multi-layer perceptrons (MLP), a type of feedforward neural network; and long short-term memory (LSTM), a type of recurrent neural network. For each type of neural network, we explore the effects of varying their configuration (e.g. number of layers or number of neurons) and hyper-parameters (e.g. drop-out ratio). We find that the prediction accuracy of neural networks vastly outperforms the Markov chain approach, and that LSTMs perform slightly better than MLPs. We offer some qualitative interpretation of our findings and identify some promising future directions.
View details

Scalable K-Means by ranked retrieval

Lluis Garcia-Pueyo

Vanja Josifovski

Srihari Venkatesan

Proceedings of the 7th ACM international conference on Web search and data mining, ACM, New York, NY, USA (2014), pp. 233-242

Preview abstract
The k-means clustering algorithm has a long history and a proven practical performance, however it does not scale to clustering millions of data points into thousands of clusters in high dimensional spaces. The main computational bottleneck is the need to recompute the nearest centroid for every data point at every iteration, aprohibitive cost when the number of clusters is large. In this paper we show how to reduce the cost of the k-means algorithm by large factors by adapting ranked retrieval techniques. Using a combination of heuristics, on two real life data sets the wall clock time per iteration is reduced from 445 minutes to less than 4, and from 705 minutes to 1.4, while the clustering quality remains within 0.5% of the k-means quality.
The key insight is to invert the process of point-to-centroid assignment by creating an inverted index over all the points and then using the current centroids as queries to this index to decide on cluster membership. In other words, rather than each iteration consisting of "points picking centroids", each iteration now consists of "centroids picking points". This is much more efficient, but comes at the cost of leaving some points unassigned to any centroid. We show experimentally that the number of such points is low and thus they can be separately assigned once the final centroids are decided. To speed up the computation we sparsify the centroids by pruning low weight features. Finally, to further reduce the running time and the number of unassigned points, we propose a variant of the WAND algorithm that uses the results of the intermediate results of nearest neighbor computations to improve performance.
View details

IR paradigms in computational advertising

Preview
SIGIR (2012), pp. 1019

Efficiently Evaluating Graph Constraints in Content-Based Publish/Subscribe

Preview
Shirshanka Das

Marcus Fontoura

Bhaskar Ghosh

Vanja Josifovski

Jayavel Shanmugasundaram

The 20th International World Wide Web Confererence (WWW 2011)

Information extraction meets relation databases

Preview
Davood Rafiei

Edward Chang

Patrick Pantel

CIKM '09: Proceeding of the 18th ACM conference on Information and knowledge management, ACM, New York, NY, USA (2009), pp. 897-897

Indexing Shared Content in Information Retrieval Systems

Preview
Nadav Eiron

Marcus Fontoura

Michael Herscovici

Ronny Lempel

John McPherson

Runping Qi

Eugene J. Shekita

EDBT (2006), pp. 313-330

Current trends in the integration of searching and browsing

Preview
Yoelle S. Maarek

Krishna Bharat

Susan T. Dumais

Steve Papa

Jan O. Pedersen

WWW (Special interest tracks and posters) (2005), pp. 793

A Comparison of Techniques to Find Mirrored Hosts on the WWW

Preview
Krishna Bharat

Monika Rauch Henzinger

JASIS, vol. 51 (2000), pp. 1114-1122

Bid generation for advanced match in sponsored search

Evgeniy Gabrilovich

Vanja Josifovski

George Mavromatis

Alex J. Smola

WSDM (2011), pp. 515-524

Web Page Summarization for Just-in-Time Contextual Advertising

Aris Anagnostopoulos

Evgeniy Gabrilovich

Vanja Josifovski

Lance Riedel

ACM TIST, vol. 3 (2011), pp. 14

Introduction to display advertising: a half-day tutorial

Efficiently evaluating graph constraints in content-based publish/subscribe

Shirshanka Das

Marcus Fontoura

Bhaskar Ghosh

Vanja Josifovski

Jayavel Shanmugasundaram

WWW (2011), pp. 497-506

An introduction to online targeted advertising: principles, implementation, controversies

IUI (2011), pp. 103-104

Information retrieval challenges in computational advertising

Highly Dimensional Problems in Computational Advertising

ECML/PKDD (1) (2011), pp. 5

Information retrieval challenges in computational advertising

Competing for users' attention: on the interplay between organic and sponsored search results

Cristian Danescu-Niculescu-Mizil

Evgeniy Gabrilovich

Vanja Josifovski

Bo Pang

WWW (2010), pp. 291-300

Exploiting site-level information to improve web search

Evgeniy Gabrilovich

Vanja Josifovski

George Mavromatis

Jane Wang

CIKM (2010), pp. 1393-1396

Search is dead!: long live search

Elizabeth F. Churchill

Marti Hearst

Barney Pell

WWW (2010), pp. 1337-1338

Automatic generation of bid phrases for online advertising

Sujith Ravi

Evgeniy Gabrilovich

Vanja Josifovski

Sandeep Pandey

Bo Pang

Proceedings of the International Conference on Web Search and Data Mining (WSDM) (2010), pp. 341-350

Anatomy of the long tail: ordinary people with extraordinary tastes

The Anatomy of the Long Tail of Consumer Demand

WAW (2010), pp. 1

The New Frontier of Web Search Technology: Seven Challenges

A search-based method for forecasting ad impression in contextual advertising

Web Advertising

Context transfer in search advertising

Online expansion of rare queries for sponsored search

Peter Ciccolo

Evgeniy Gabrilovich

Vanja Josifovski

Lance Riedel

Jeffrey Yuan

WWW (2009), pp. 511-520

The Hiring Problem and Lake Wobegon Strategies

Adam Kirsch

Ravi Kumar

Michael Mitzenmacher

Eli Upfal

SIAM J. Comput., vol. 39 (2009), pp. 1233-1255

Cross-language query classification using web search for exogenous knowledge

Classifying search queries using the Web as a source of knowledge

Evgeniy Gabrilovich

Marcus Fontoura

Amruta Joshi

Vanja Josifovski

Lance Riedel

Tong Zhang

TWEB, vol. 3 (2009)

Algorithmic Challenge in Online Advertising

AAIM (2009), pp. 1

What happens after an ad click?: quantifying the impact of landing pages in web advertising

Nearest-neighbor caching for content-match applications

Sandeep Pandey

Flavio Chierichetti

Vanja Josifovski

Ravi Kumar

WWW (2009), pp. 441-450

Computational advertising

SODA (2008), pp. 992

A note on search based forecasting of ad volume in contextual advertising

Introduction to special issue on query log analysis: Technology and ethics

Effective and efficient classification on a search-engine model

Optimizing relevance and revenue in ad search: a query substitution approach

Filip Radlinski

Peter Ciccolo

Evgeniy Gabrilovich

Vanja Josifovski

Lance Riedel

SIGIR (2008), pp. 403-410

Computational advertising and recommender systems

RecSys (2008), pp. 1-2

Reviewing the Reviewers: Characterizing Biases and Competencies using Socially Meaningful Attributes

Sihem Amer-Yahia

Alban Galland

AAAI Spring Symposium: Social Information Processing (2008), pp. 1-6

To swing or not to swing: learning when (not) to advertise

Marcus Fontoura

Evgeniy Gabrilovich

Vanja Josifovski

Vanessa Murdock

Vassilis Plachouras

CIKM (2008), pp. 1003-1012

Cross-lingual query classification: a preliminary study

Xuerui Wang

Evgeniy Gabrilovich

Vanja Josifovski

Bo Pang

CIKM-iNEWS (2008), pp. 101-104

The hiring problem and Lake Wobegon strategies

Adam Kirsch

Ravi Kumar

Michael Mitzenmacher

Eli Upfal

SODA (2008), pp. 1184-1193

Search advertising using web relevance feedback

Peter Ciccolo

Marcus Fontoura

Evgeniy Gabrilovich

Vanja Josifovski

Lance Riedel

CIKM (2008), pp. 1013-1022

A semantic approach to contextual advertising

Margin Based Active Learning

The Next Generation Web Search and the Demise of the Classic IR Model

ECIR (2007), pp. 1

Estimating rates of rare events at multiple resolutions

Deepak Agarwal

Deepayan Chakrabarti

Dejan Diklic

Vanja Josifovski

Mayssam Sayyadian

KDD (2007), pp. 16-25

Robust classification of rare queries using web knowledge

Marcus Fontoura

Evgeniy Gabrilovich

Amruta Joshi

Vanja Josifovski

Tong Zhang

SIGIR (2007), pp. 231-238

Just-in-time contextual advertising

Aris Anagnostopoulos

Evgeniy Gabrilovich

Vanja Josifovski

Lance Riedel

CIKM (2007), pp. 331-340

Efficient PageRank approximation via graph aggregation

The Future of Web Search: From Information Retrieval to Information Supply

NGITS (2006), pp. 362

Estimating corpus size via queries

Marcus Fontoura

Vanja Josifovski

Ravi Kumar

Rajeev Motwani

Shubha U. Nabar

Rina Panigrahy

Ying Xu 0002

CIKM (2006), pp. 594-603

Sampling Search-Engine Results

Modelling and Mining of Networked Information Spaces

Workshop on Algorithms and Models for the Web Graph

Effective and efficient classification on a search-engine model

How search engines shape the web

Byron Dom

Krishna Bharat

Jan O. Pedersen

Yoshinobu Tonomura

WWW (Special interest tracks and posters) (2005), pp. 879

Sampling search-engine results

Sic transit gloria telae: towards an understanding of the web's decay

Using XML to Query XML - From Theory to Practice

Invited Talk: The Many Wonders of the Web Graph

CAAN (2004), pp. 154-154

Efficient pagerank approximation via graph aggregation

Ronny Lempel

Farzin Maghoul

Jan O. Pedersen

WWW (Alternate Track Papers & Posters) (2004), pp. 484-485

Towards the next generation of enterprise search technology

A derandomization using min-wise independent permutations

Moses Charikar

Michael Mitzenmacher

J. Discrete Algorithms, vol. 1 (2003), pp. 11-20

Survey: Network Applications of Bloom Filters: A Survey

Efficient query evaluation using a two-level retrieval process

Efficient URL caching for world wide web crawling

Keynote Address - exploring, modeling, and using the web graph

SIGIR (2003), pp. 1

A taxonomy of web search

SIGIR Forum, vol. 36 (2002), pp. 3-10

Completeness and robustness properties of min-wise independent permutations

A general approach to dynamic packet routing with bounded buffers

Min-wise Independent Permutations: Theory and Practice

ICALP (2000), pp. 808

Introduction: The Fourth International Workshop on Randomization and Approximation Techniques in Computer Science

ICALP Satellite Workshops (2000), pp. 1-2

A Comparison of Techniques to Find Mirrored Hosts on the WWW

Krishna Bharat

Monika Rauch Henzinger

IEEE Data Eng. Bull., vol. 23 (2000), pp. 21-26

Min-Wise Independent Permutations

Moses Charikar

Alan M. Frieze

Michael Mitzenmacher

J. Comput. Syst. Sci., vol. 60 (2000), pp. 630-659

Min-Wise versus linear independence (extended abstract)

Graph structure in the Web

Ravi Kumar

Farzin Maghoul

Sridhar Rajagopalan

Raymie Stata

Janet L. Wiener

Computer Networks, vol. 33 (2000), pp. 309-320

Identifying and Filtering Near-Duplicate Documents

CPM (2000), pp. 1-10

A Comparison of Techniques to Find Mirrored Hosts on the WWW

Mirror, Mirror on the Web: A Study of Host Pairs with Replicated Content

Completeness and Robustness Properties of Min-Wise Independent Permutations

Balanced Allocations

Static and Dynamic Path Selection on Expander Graphs: A Random Walk Approach

Dynamic Packet Routing on Arrays with Bounded Buffers

A Derandomization Using Min-Wise Independent Permutations

A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines

Min-Wise Independent Permutations (Extended Abstract)

Optimal Construction of Edge-Disjoint Paths in Random Graphs

The Connectivity Server: Fast Access to Linkage Information on the Web

Krishna Bharat

Monika Rauch Henzinger

Puneet Kumar

Suresh Venkatasubramanian

Computer Networks, vol. 30 (1998), pp. 469-477

Syntactic Clustering of the Web

Steven C. Glassman

Mark S. Manasse

Geoffrey Zweig

Computer Networks, vol. 29 (1997), pp. 1157-1166

Static and Dynamic Path Selection on Expander Graphs: A Random Walk Approach (Preliminary Version)

Counting Minimum Weight Spanning Trees

An Efficient Algorithm for the Vertex-Disjoint Paths Problem in Random Graphs

Dynamic Deflection Routing on Arrays (Preliminary Version)

Pattern-based Compression of Text Images

Biased Random Walks

Yossi Azar

Anna R. Karlin

Nathan Linial

Steven Phillips

Combinatorica, vol. 16 (1996), pp. 1-18

A General Approach to Dynamic Packet Routing with Bounded Buffers (extended abstract)

Balanced Allocations for Tree-Like Inputs

Alan M. Frieze

Carsten Lund

Steven Phillips

Nick Reingold

Inf. Process. Lett., vol. 55 (1995), pp. 329-332

The Worst-Case Running Time of the Random Simplex Algorithm is Exponential in the Height

Martin E. Dyer

Alan M. Frieze

Eli Upfal

Inf. Process. Lett., vol. 56 (1995), pp. 79-81

On the Problem of Approximating the Number of Bases of a Matroid

Finding Hidden Hamiltonian Cycles

Balanced allocations (extended abstract)

Optimal Construction of Edge-Disjoint Paths in Random Graphs

On-Line Load Balancing

Existence and Construction of Edge-Disjoint Paths on Expander Graphs

Near-perfect Token Distribution

Alan M. Frieze

Eli Shamir

Eli Upfal

Random Struct. Algorithms, vol. 5 (1994), pp. 559-572

Trading Space for Time in Undirected s-t Connectivity

Anna R. Karlin

Eli Upfal

SIAM J. Comput., vol. 23 (1994), pp. 324-334

On-line Choice of On-line Algorithms

On the Satisfiability and Maximum Satisfiability of Random 3-CNF Formulas

Near-perfect Token Distribution

Existence and Construction of Edge Disjoint Paths on Expander Graphs

On-line Load Balancing (Extended Abstract)

Biased Random Walks

On the Parallel Complexity of Evaluating Game Trees

Finding Hidden Hamiltonian Cycles (Extended Abstract)

Multilevel Adaptive Hashing

The Cost Distribution of Clustering in Random Probing

Trading Space for Time in Undirected s-t Connectivity

Generating Random Spanning Trees

FOCS (1989), pp. 442-447

Bounds on the Cover Time (Preliminary Version)

On Generating Solved Instances of Computational Problems

Eric Allender

Joan Feigenbaum

Lane A. Hemachandra

CRYPTO (1988), pp. 297-310

Errata to "How hard is to marry at random? (On the approximation of the permanent)"

STOC (1988), pp. 551

On the Second Eigenvalue of Random Regular Graphs (Preliminary Version)

Efficient Fault-Tolerant Routings in Networks

Danny Dolev

Michael J. Fischer

Barbara Simons

Inf. Comput., vol. 75 (1987), pp. 52-64

How hard is to marry at random? (On the approximation of the permanent)

STOC (1986), pp. 50-58

A Provably Secure Polynomial Approximation Scheme for the Distributed Lottery Problem (Extended Abstract)

PODC (1985), pp. 136-148

Placing Tiles in the Plane

Flipping coins in many pockets (Byzantine agreement on uniformly random values)

The r-Stirling numbers

Discrete Mathematics, vol. 49 (1984), pp. 241-259

Efficient Fault Tolerant Routings in Networks