Andrei Z. Broder
Research Areas
Authored Publications
Google Publications
Other Publications
Sort By
Learning Effective Embeddings for Machine Generated Emails with Applications to Email Category Prediction
Yu Sun
Luis Garcia Pueyo
Proceedings of the IEEE International Conference on Big Data (2018), pp. 1846-1855
Preview abstract
Machine-generated business-to-consumer (B2C) emails such as receipts, newsletters, and promotions constitute today a large portion of users' inbox. These emails reflect the users' interests and often are sequentially correlated, e.g., users interested in relocating may receive a sequence of messages on housing, moving, job availability, etc. We aim to infer (and eventually serve) the users' future interests by predicting the categories of their future emails. There are many good methods such as recurrent neural networks that can be applied for such predictions but in all cases the key to better performance is an effective representation of emails and users. To this end, we propose a general framework for embedding learning for emails and users, using as input only the sequence of B2C templates users receive and open. (A template is a B2C email stripped of all transient information related to specific users.) These learned embeddings allow us to identify both sequentially correlated emails and users with similar sequential interests. We can also use the learned embeddings either as input features or embedding initializers for email category predictions. Extensive experiments with millions of fully anonymized B2C emails demonstrate that the learned embeddings can significantly improve the prediction accuracy for future email categories. We hope that this effective yet simple embedding learning framework will inspire new machine intelligence applications that will improve the users' email experience.
View details
Email Category Prediction
Aston Zhang
Luis Garcia Pueyo
Companion Proc. of the 26th International World Wide Web Conference (2017), pp. 495-503
Preview abstract
According to recent estimates, about 90% of consumer received emails are machine-generated. Such messages include shopping receipts, promotional campaigns, newsletters, booking confirmations, etc. Most such messages are created by populating a fixed template with a small amount of personalized information, such as name, salutation, reservation numbers, dates, etc. Web mail providers (Gmail, Hotmail, Yahoo) are leveraging the structured nature of such emails to extract salient information and use it to improve the user experience: e.g. by automatically entering reservation data into a user calendar, or by sending alerts about upcoming shipments. To facilitate these extraction tasks it is helpful to classify templates according to their category, e.g. restaurant reservations or bill reminders, since each category triggers a particular user experience.
Recent research has focused on discovering the causal thread of templates, e.g. inferring that a shopping order is usually followed by a shipping confirmation, an airline booking is followed by a confirmation and then by a “ready to check in” message, etc. Gamzu et al. took this idea one step further by implementing a method to predict the template category of future emails for a given user based on previously received templates. The motivation is that predicting future emails has a wide range of potential applications, including better user experiences (e.g. warning users of items ordered but not shipped), targeted advertising (e.g. users that recently made a flight reservation may be interested in hotel reservations), and spam classification (a message that is part of a legitimate causal thread is unlikely to be spam).
The gist of the Gamzu et al. approach is modeling the problem as a Markov chain, where the nodes are templates or temporal events (e.g. the first day of the month). This paper expands on their work by investigating the use of neural networks for predicting the category of emails that will arrive during a fixed-sized time window in the future. We consider two types of neural networks: multi-layer perceptrons (MLP), a type of feedforward neural network; and long short-term memory (LSTM), a type of recurrent neural network. For each type of neural network, we explore the effects of varying their configuration (e.g. number of layers or number of neurons) and hyper-parameters (e.g. drop-out ratio). We find that the prediction accuracy of neural networks vastly outperforms the Markov chain approach, and that LSTMs perform slightly better than MLPs. We offer some qualitative interpretation of our findings and identify some promising future directions.
View details
Scalable K-Means by ranked retrieval
Lluis Garcia-Pueyo
Vanja Josifovski
Srihari Venkatesan
Proceedings of the 7th ACM international conference on Web search and data mining, ACM, New York, NY, USA (2014), pp. 233-242
Preview abstract
The k-means clustering algorithm has a long history and a proven practical performance, however it does not scale to clustering millions of data points into thousands of clusters in high dimensional spaces. The main computational bottleneck is the need to recompute the nearest centroid for every data point at every iteration, aprohibitive cost when the number of clusters is large. In this paper we show how to reduce the cost of the k-means algorithm by large factors by adapting ranked retrieval techniques. Using a combination of heuristics, on two real life data sets the wall clock time per iteration is reduced from 445 minutes to less than 4, and from 705 minutes to 1.4, while the clustering quality remains within 0.5% of the k-means quality.
The key insight is to invert the process of point-to-centroid assignment by creating an inverted index over all the points and then using the current centroids as queries to this index to decide on cluster membership. In other words, rather than each iteration consisting of "points picking centroids", each iteration now consists of "centroids picking points". This is much more efficient, but comes at the cost of leaving some points unassigned to any centroid. We show experimentally that the number of such points is low and thus they can be separately assigned once the final centroids are decided. To speed up the computation we sparsify the centroids by pruning low weight features. Finally, to further reduce the running time and the number of unassigned points, we propose a variant of the WAND algorithm that uses the results of the intermediate results of nearest neighbor computations to improve performance.
View details
IR paradigms in computational advertising
Preview
SIGIR (2012), pp. 1019
Efficiently Evaluating Graph Constraints in Content-Based Publish/Subscribe
Preview
Shirshanka Das
Marcus Fontoura
Bhaskar Ghosh
Vanja Josifovski
Jayavel Shanmugasundaram
The 20th International World Wide Web Confererence (WWW 2011)
Information extraction meets relation databases
Preview
Davood Rafiei
Edward Chang
Patrick Pantel
CIKM '09: Proceeding of the 18th ACM conference on Information and knowledge management, ACM, New York, NY, USA (2009), pp. 897-897
Indexing Shared Content in Information Retrieval Systems
Preview
Nadav Eiron
Marcus Fontoura
Michael Herscovici
Ronny Lempel
John McPherson
Runping Qi
Eugene J. Shekita
EDBT (2006), pp. 313-330
Current trends in the integration of searching and browsing
Preview
Yoelle S. Maarek
Krishna Bharat
Susan T. Dumais
Steve Papa
Jan O. Pedersen
WWW (Special interest tracks and posters) (2005), pp. 793
A Comparison of Techniques to Find Mirrored Hosts on the WWW
Preview
Krishna Bharat
Monika Rauch Henzinger
JASIS, vol. 51 (2000), pp. 1114-1122
Highly Dimensional Problems in Computational Advertising
ECML/PKDD (1) (2011), pp. 5
Introduction to display advertising: a half-day tutorial
Bid generation for advanced match in sponsored search
Evgeniy Gabrilovich
Vanja Josifovski
George Mavromatis
Alex J. Smola
WSDM (2011), pp. 515-524
An introduction to online targeted advertising: principles, implementation, controversies
IUI (2011), pp. 103-104
Efficiently evaluating graph constraints in content-based publish/subscribe
Shirshanka Das
Marcus Fontoura
Bhaskar Ghosh
Vanja Josifovski
Jayavel Shanmugasundaram
WWW (2011), pp. 497-506
Web Page Summarization for Just-in-Time Contextual Advertising
Aris Anagnostopoulos
Evgeniy Gabrilovich
Vanja Josifovski
Lance Riedel
ACM TIST, vol. 3 (2011), pp. 14
Information retrieval challenges in computational advertising
Automatic generation of bid phrases for online advertising
Sujith Ravi
Evgeniy Gabrilovich
Vanja Josifovski
Sandeep Pandey
Bo Pang
Proceedings of the International Conference on Web Search and Data Mining (WSDM) (2010), pp. 341-350
The New Frontier of Web Search Technology: Seven Challenges
Search is dead!: long live search
Elizabeth F. Churchill
Marti Hearst
Barney Pell
WWW (2010), pp. 1337-1338
Competing for users' attention: on the interplay between organic and sponsored search results
Cristian Danescu-Niculescu-Mizil
Evgeniy Gabrilovich
Vanja Josifovski
Bo Pang
WWW (2010), pp. 291-300
Exploiting site-level information to improve web search
Evgeniy Gabrilovich
Vanja Josifovski
George Mavromatis
Jane Wang
CIKM (2010), pp. 1393-1396
Anatomy of the long tail: ordinary people with extraordinary tastes
The Anatomy of the Long Tail of Consumer Demand
WAW (2010), pp. 1
Information retrieval challenges in computational advertising
Online expansion of rare queries for sponsored search
Peter Ciccolo
Evgeniy Gabrilovich
Vanja Josifovski
Lance Riedel
Jeffrey Yuan
WWW (2009), pp. 511-520
The Hiring Problem and Lake Wobegon Strategies
Adam Kirsch
Ravi Kumar
Michael Mitzenmacher
Eli Upfal
SIAM J. Comput., vol. 39 (2009), pp. 1233-1255
Nearest-neighbor caching for content-match applications
Sandeep Pandey
Flavio Chierichetti
Vanja Josifovski
Ravi Kumar
WWW (2009), pp. 441-450
A search-based method for forecasting ad impression in contextual advertising
Algorithmic Challenge in Online Advertising
AAIM (2009), pp. 1
Classifying search queries using the Web as a source of knowledge
Evgeniy Gabrilovich
Marcus Fontoura
Amruta Joshi
Vanja Josifovski
Lance Riedel
Tong Zhang
TWEB, vol. 3 (2009)
What happens after an ad click?: quantifying the impact of landing pages in web advertising
Web Advertising
Context transfer in search advertising
Cross-language query classification using web search for exogenous knowledge
Reviewing the Reviewers: Characterizing Biases and Competencies using Socially Meaningful Attributes
Sihem Amer-Yahia
Alban Galland
AAAI Spring Symposium: Social Information Processing (2008), pp. 1-6
A note on search based forecasting of ad volume in contextual advertising
Introduction to special issue on query log analysis: Technology and ethics
Optimizing relevance and revenue in ad search: a query substitution approach
Filip Radlinski
Peter Ciccolo
Evgeniy Gabrilovich
Vanja Josifovski
Lance Riedel
SIGIR (2008), pp. 403-410
Computational advertising and recommender systems
RecSys (2008), pp. 1-2
Cross-lingual query classification: a preliminary study
Xuerui Wang
Evgeniy Gabrilovich
Vanja Josifovski
Bo Pang
CIKM-iNEWS (2008), pp. 101-104
Search advertising using web relevance feedback
Peter Ciccolo
Marcus Fontoura
Evgeniy Gabrilovich
Vanja Josifovski
Lance Riedel
CIKM (2008), pp. 1013-1022
To swing or not to swing: learning when (not) to advertise
Marcus Fontoura
Evgeniy Gabrilovich
Vanja Josifovski
Vanessa Murdock
Vassilis Plachouras
CIKM (2008), pp. 1003-1012
The hiring problem and Lake Wobegon strategies
Adam Kirsch
Ravi Kumar
Michael Mitzenmacher
Eli Upfal
SODA (2008), pp. 1184-1193
Computational advertising
SODA (2008), pp. 992
Effective and efficient classification on a search-engine model
Just-in-time contextual advertising
Aris Anagnostopoulos
Evgeniy Gabrilovich
Vanja Josifovski
Lance Riedel
CIKM (2007), pp. 331-340
Robust classification of rare queries using web knowledge
Marcus Fontoura
Evgeniy Gabrilovich
Amruta Joshi
Vanja Josifovski
Tong Zhang
SIGIR (2007), pp. 231-238
Estimating rates of rare events at multiple resolutions
Deepak Agarwal
Deepayan Chakrabarti
Dejan Diklic
Vanja Josifovski
Mayssam Sayyadian
KDD (2007), pp. 16-25
Margin Based Active Learning
A semantic approach to contextual advertising
The Next Generation Web Search and the Demise of the Classic IR Model
ECIR (2007), pp. 1
Effective and efficient classification on a search-engine model
The Future of Web Search: From Information Retrieval to Information Supply
NGITS (2006), pp. 362
Workshop on Algorithms and Models for the Web Graph
Estimating corpus size via queries
Marcus Fontoura
Vanja Josifovski
Ravi Kumar
Rajeev Motwani
Shubha U. Nabar
Rina Panigrahy
Ying Xu 0002
CIKM (2006), pp. 594-603
Modelling and Mining of Networked Information Spaces
Efficient PageRank approximation via graph aggregation
Sampling Search-Engine Results
How search engines shape the web
Byron Dom
Krishna Bharat
Jan O. Pedersen
Yoshinobu Tonomura
WWW (Special interest tracks and posters) (2005), pp. 879
Sampling search-engine results
Using XML to Query XML - From Theory to Practice
Towards the next generation of enterprise search technology
Efficient pagerank approximation via graph aggregation
Ronny Lempel
Farzin Maghoul
Jan O. Pedersen
WWW (Alternate Track Papers & Posters) (2004), pp. 484-485
Sic transit gloria telae: towards an understanding of the web's decay
Invited Talk: The Many Wonders of the Web Graph
CAAN (2004), pp. 154-154
Survey: Network Applications of Bloom Filters: A Survey
Efficient query evaluation using a two-level retrieval process
Keynote Address - exploring, modeling, and using the web graph
SIGIR (2003), pp. 1
Efficient URL caching for world wide web crawling
A derandomization using min-wise independent permutations
Moses Charikar
Michael Mitzenmacher
J. Discrete Algorithms, vol. 1 (2003), pp. 11-20
A taxonomy of web search
SIGIR Forum, vol. 36 (2002), pp. 3-10
A general approach to dynamic packet routing with bounded buffers
Completeness and robustness properties of min-wise independent permutations
Min-wise Independent Permutations: Theory and Practice
ICALP (2000), pp. 808
Identifying and Filtering Near-Duplicate Documents
CPM (2000), pp. 1-10
Introduction: The Fourth International Workshop on Randomization and Approximation Techniques in Computer Science
ICALP Satellite Workshops (2000), pp. 1-2
Min-Wise versus linear independence (extended abstract)
A Comparison of Techniques to Find Mirrored Hosts on the WWW
Krishna Bharat
Monika Rauch Henzinger
IEEE Data Eng. Bull., vol. 23 (2000), pp. 21-26
Graph structure in the Web
Ravi Kumar
Farzin Maghoul
Sridhar Rajagopalan
Raymie Stata
Janet L. Wiener
Computer Networks, vol. 33 (2000), pp. 309-320
Min-Wise Independent Permutations
Moses Charikar
Alan M. Frieze
Michael Mitzenmacher
J. Comput. Syst. Sci., vol. 60 (2000), pp. 630-659
A Comparison of Techniques to Find Mirrored Hosts on the WWW
Mirror, Mirror on the Web: A Study of Host Pairs with Replicated Content
Completeness and Robustness Properties of Min-Wise Independent Permutations
Balanced Allocations
Static and Dynamic Path Selection on Expander Graphs: A Random Walk Approach
Dynamic Packet Routing on Arrays with Bounded Buffers
A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines
A Derandomization Using Min-Wise Independent Permutations
The Connectivity Server: Fast Access to Linkage Information on the Web
Krishna Bharat
Monika Rauch Henzinger
Puneet Kumar
Suresh Venkatasubramanian
Computer Networks, vol. 30 (1998), pp. 469-477
Min-Wise Independent Permutations (Extended Abstract)
Optimal Construction of Edge-Disjoint Paths in Random Graphs
Syntactic Clustering of the Web
Steven C. Glassman
Mark S. Manasse
Geoffrey Zweig
Computer Networks, vol. 29 (1997), pp. 1157-1166
Counting Minimum Weight Spanning Trees
Static and Dynamic Path Selection on Expander Graphs: A Random Walk Approach (Preliminary Version)
Dynamic Deflection Routing on Arrays (Preliminary Version)
A General Approach to Dynamic Packet Routing with Bounded Buffers (extended abstract)
An Efficient Algorithm for the Vertex-Disjoint Paths Problem in Random Graphs
Pattern-based Compression of Text Images
Biased Random Walks
Yossi Azar
Anna R. Karlin
Nathan Linial
Steven Phillips
Combinatorica, vol. 16 (1996), pp. 1-18
Balanced Allocations for Tree-Like Inputs
Alan M. Frieze
Carsten Lund
Steven Phillips
Nick Reingold
Inf. Process. Lett., vol. 55 (1995), pp. 329-332
The Worst-Case Running Time of the Random Simplex Algorithm is Exponential in the Height
Martin E. Dyer
Alan M. Frieze
Eli Upfal
Inf. Process. Lett., vol. 56 (1995), pp. 79-81
Finding Hidden Hamiltonian Cycles
On-Line Load Balancing
Balanced allocations (extended abstract)
Optimal Construction of Edge-Disjoint Paths in Random Graphs
On the Problem of Approximating the Number of Bases of a Matroid
Near-perfect Token Distribution
Alan M. Frieze
Eli Shamir
Eli Upfal
Random Struct. Algorithms, vol. 5 (1994), pp. 559-572
Existence and Construction of Edge-Disjoint Paths on Expander Graphs
Trading Space for Time in Undirected s-t Connectivity
Anna R. Karlin
Eli Upfal
SIAM J. Comput., vol. 23 (1994), pp. 324-334
On the Satisfiability and Maximum Satisfiability of Random 3-CNF Formulas
On-line Choice of On-line Algorithms
Existence and Construction of Edge Disjoint Paths on Expander Graphs
On-line Load Balancing (Extended Abstract)
Biased Random Walks
Near-perfect Token Distribution
On the Parallel Complexity of Evaluating Game Trees
Finding Hidden Hamiltonian Cycles (Extended Abstract)
The Cost Distribution of Clustering in Random Probing
Multilevel Adaptive Hashing
Trading Space for Time in Undirected s-t Connectivity
Generating Random Spanning Trees
FOCS (1989), pp. 442-447
On Generating Solved Instances of Computational Problems
Eric Allender
Joan Feigenbaum
Lane A. Hemachandra
CRYPTO (1988), pp. 297-310
Bounds on the Cover Time (Preliminary Version)
Errata to "How hard is to marry at random? (On the approximation of the permanent)"
STOC (1988), pp. 551
Efficient Fault-Tolerant Routings in Networks
Danny Dolev
Michael J. Fischer
Barbara Simons
Inf. Comput., vol. 75 (1987), pp. 52-64
On the Second Eigenvalue of Random Regular Graphs (Preliminary Version)
How hard is to marry at random? (On the approximation of the permanent)
STOC (1986), pp. 50-58
A Provably Secure Polynomial Approximation Scheme for the Distributed Lottery Problem (Extended Abstract)
PODC (1985), pp. 136-148
Placing Tiles in the Plane
Flipping coins in many pockets (Byzantine agreement on uniformly random values)
The r-Stirling numbers
Discrete Mathematics, vol. 49 (1984), pp. 241-259
Efficient Fault Tolerant Routings in Networks