Andrei Z. Broder
Research Areas
Authored Publications
Sort By
Learning Effective Embeddings for Machine Generated Emails with Applications to Email Category Prediction
Yu Sun
Luis Garcia Pueyo
Proceedings of the IEEE International Conference on Big Data (2018), pp. 1846-1855
Preview abstract
Machine-generated business-to-consumer (B2C) emails such as receipts, newsletters, and promotions constitute today a large portion of users' inbox. These emails reflect the users' interests and often are sequentially correlated, e.g., users interested in relocating may receive a sequence of messages on housing, moving, job availability, etc. We aim to infer (and eventually serve) the users' future interests by predicting the categories of their future emails. There are many good methods such as recurrent neural networks that can be applied for such predictions but in all cases the key to better performance is an effective representation of emails and users. To this end, we propose a general framework for embedding learning for emails and users, using as input only the sequence of B2C templates users receive and open. (A template is a B2C email stripped of all transient information related to specific users.) These learned embeddings allow us to identify both sequentially correlated emails and users with similar sequential interests. We can also use the learned embeddings either as input features or embedding initializers for email category predictions. Extensive experiments with millions of fully anonymized B2C emails demonstrate that the learned embeddings can significantly improve the prediction accuracy for future email categories. We hope that this effective yet simple embedding learning framework will inspire new machine intelligence applications that will improve the users' email experience.
View details
Email Category Prediction
Aston Zhang
Luis Garcia Pueyo
Companion Proc. of the 26th International World Wide Web Conference (2017), pp. 495-503
Preview abstract
According to recent estimates, about 90% of consumer received emails are machine-generated. Such messages include shopping receipts, promotional campaigns, newsletters, booking confirmations, etc. Most such messages are created by populating a fixed template with a small amount of personalized information, such as name, salutation, reservation numbers, dates, etc. Web mail providers (Gmail, Hotmail, Yahoo) are leveraging the structured nature of such emails to extract salient information and use it to improve the user experience: e.g. by automatically entering reservation data into a user calendar, or by sending alerts about upcoming shipments. To facilitate these extraction tasks it is helpful to classify templates according to their category, e.g. restaurant reservations or bill reminders, since each category triggers a particular user experience.
Recent research has focused on discovering the causal thread of templates, e.g. inferring that a shopping order is usually followed by a shipping confirmation, an airline booking is followed by a confirmation and then by a “ready to check in” message, etc. Gamzu et al. took this idea one step further by implementing a method to predict the template category of future emails for a given user based on previously received templates. The motivation is that predicting future emails has a wide range of potential applications, including better user experiences (e.g. warning users of items ordered but not shipped), targeted advertising (e.g. users that recently made a flight reservation may be interested in hotel reservations), and spam classification (a message that is part of a legitimate causal thread is unlikely to be spam).
The gist of the Gamzu et al. approach is modeling the problem as a Markov chain, where the nodes are templates or temporal events (e.g. the first day of the month). This paper expands on their work by investigating the use of neural networks for predicting the category of emails that will arrive during a fixed-sized time window in the future. We consider two types of neural networks: multi-layer perceptrons (MLP), a type of feedforward neural network; and long short-term memory (LSTM), a type of recurrent neural network. For each type of neural network, we explore the effects of varying their configuration (e.g. number of layers or number of neurons) and hyper-parameters (e.g. drop-out ratio). We find that the prediction accuracy of neural networks vastly outperforms the Markov chain approach, and that LSTMs perform slightly better than MLPs. We offer some qualitative interpretation of our findings and identify some promising future directions.
View details
Scalable K-Means by ranked retrieval
Lluis Garcia-Pueyo
Vanja Josifovski
Srihari Venkatesan
Proceedings of the 7th ACM international conference on Web search and data mining, ACM, New York, NY, USA (2014), pp. 233-242
Preview abstract
The k-means clustering algorithm has a long history and a proven practical performance, however it does not scale to clustering millions of data points into thousands of clusters in high dimensional spaces. The main computational bottleneck is the need to recompute the nearest centroid for every data point at every iteration, aprohibitive cost when the number of clusters is large. In this paper we show how to reduce the cost of the k-means algorithm by large factors by adapting ranked retrieval techniques. Using a combination of heuristics, on two real life data sets the wall clock time per iteration is reduced from 445 minutes to less than 4, and from 705 minutes to 1.4, while the clustering quality remains within 0.5% of the k-means quality.
The key insight is to invert the process of point-to-centroid assignment by creating an inverted index over all the points and then using the current centroids as queries to this index to decide on cluster membership. In other words, rather than each iteration consisting of "points picking centroids", each iteration now consists of "centroids picking points". This is much more efficient, but comes at the cost of leaving some points unassigned to any centroid. We show experimentally that the number of such points is low and thus they can be separately assigned once the final centroids are decided. To speed up the computation we sparsify the centroids by pruning low weight features. Finally, to further reduce the running time and the number of unassigned points, we propose a variant of the WAND algorithm that uses the results of the intermediate results of nearest neighbor computations to improve performance.
View details
IR paradigms in computational advertising
Preview
SIGIR (2012), pp. 1019
Efficiently Evaluating Graph Constraints in Content-Based Publish/Subscribe
Preview
Shirshanka Das
Marcus Fontoura
Bhaskar Ghosh
Vanja Josifovski
Jayavel Shanmugasundaram
The 20th International World Wide Web Confererence (WWW 2011)
Information extraction meets relation databases
Preview
Davood Rafiei
Edward Chang
Patrick Pantel
CIKM '09: Proceeding of the 18th ACM conference on Information and knowledge management, ACM, New York, NY, USA (2009), pp. 897-897
Indexing Shared Content in Information Retrieval Systems
Preview
Nadav Eiron
Marcus Fontoura
Michael Herscovici
Ronny Lempel
John McPherson
Runping Qi
Eugene J. Shekita
EDBT (2006), pp. 313-330
Current trends in the integration of searching and browsing
Preview
Yoelle S. Maarek
Krishna Bharat
Susan T. Dumais
Steve Papa
Jan O. Pedersen
WWW (Special interest tracks and posters) (2005), pp. 793