Andrei Z. Broder

Andrei Z. Broder

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Learning Effective Embeddings for Machine Generated Emails with Applications to Email Category Prediction
    Yu Sun
    Luis Garcia Pueyo
    Proceedings of the IEEE International Conference on Big Data (2018), pp. 1846-1855
    Preview abstract Machine-generated business-to-consumer (B2C) emails such as receipts, newsletters, and promotions constitute today a large portion of users' inbox. These emails reflect the users' interests and often are sequentially correlated, e.g., users interested in relocating may receive a sequence of messages on housing, moving, job availability, etc. We aim to infer (and eventually serve) the users' future interests by predicting the categories of their future emails. There are many good methods such as recurrent neural networks that can be applied for such predictions but in all cases the key to better performance is an effective representation of emails and users. To this end, we propose a general framework for embedding learning for emails and users, using as input only the sequence of B2C templates users receive and open. (A template is a B2C email stripped of all transient information related to specific users.) These learned embeddings allow us to identify both sequentially correlated emails and users with similar sequential interests. We can also use the learned embeddings either as input features or embedding initializers for email category predictions. Extensive experiments with millions of fully anonymized B2C emails demonstrate that the learned embeddings can significantly improve the prediction accuracy for future email categories. We hope that this effective yet simple embedding learning framework will inspire new machine intelligence applications that will improve the users' email experience. View details
    Email Category Prediction
    Aston Zhang
    Luis Garcia Pueyo
    Companion Proc. of the 26th International World Wide Web Conference (2017), pp. 495-503
    Preview abstract According to recent estimates, about 90% of consumer received emails are machine-generated. Such messages include shopping receipts, promotional campaigns, newsletters, booking confirmations, etc. Most such messages are created by populating a fixed template with a small amount of personalized information, such as name, salutation, reservation numbers, dates, etc. Web mail providers (Gmail, Hotmail, Yahoo) are leveraging the structured nature of such emails to extract salient information and use it to improve the user experience: e.g. by automatically entering reservation data into a user calendar, or by sending alerts about upcoming shipments. To facilitate these extraction tasks it is helpful to classify templates according to their category, e.g. restaurant reservations or bill reminders, since each category triggers a particular user experience. Recent research has focused on discovering the causal thread of templates, e.g. inferring that a shopping order is usually followed by a shipping confirmation, an airline booking is followed by a confirmation and then by a “ready to check in” message, etc. Gamzu et al. took this idea one step further by implementing a method to predict the template category of future emails for a given user based on previously received templates. The motivation is that predicting future emails has a wide range of potential applications, including better user experiences (e.g. warning users of items ordered but not shipped), targeted advertising (e.g. users that recently made a flight reservation may be interested in hotel reservations), and spam classification (a message that is part of a legitimate causal thread is unlikely to be spam). The gist of the Gamzu et al. approach is modeling the problem as a Markov chain, where the nodes are templates or temporal events (e.g. the first day of the month). This paper expands on their work by investigating the use of neural networks for predicting the category of emails that will arrive during a fixed-sized time window in the future. We consider two types of neural networks: multi-layer perceptrons (MLP), a type of feedforward neural network; and long short-term memory (LSTM), a type of recurrent neural network. For each type of neural network, we explore the effects of varying their configuration (e.g. number of layers or number of neurons) and hyper-parameters (e.g. drop-out ratio). We find that the prediction accuracy of neural networks vastly outperforms the Markov chain approach, and that LSTMs perform slightly better than MLPs. We offer some qualitative interpretation of our findings and identify some promising future directions. View details
    Scalable K-Means by ranked retrieval
    Lluis Garcia-Pueyo
    Vanja Josifovski
    Srihari Venkatesan
    Proceedings of the 7th ACM international conference on Web search and data mining, ACM, New York, NY, USA (2014), pp. 233-242
    Preview abstract The k-means clustering algorithm has a long history and a proven practical performance, however it does not scale to clustering millions of data points into thousands of clusters in high dimensional spaces. The main computational bottleneck is the need to recompute the nearest centroid for every data point at every iteration, aprohibitive cost when the number of clusters is large. In this paper we show how to reduce the cost of the k-means algorithm by large factors by adapting ranked retrieval techniques. Using a combination of heuristics, on two real life data sets the wall clock time per iteration is reduced from 445 minutes to less than 4, and from 705 minutes to 1.4, while the clustering quality remains within 0.5% of the k-means quality. The key insight is to invert the process of point-to-centroid assignment by creating an inverted index over all the points and then using the current centroids as queries to this index to decide on cluster membership. In other words, rather than each iteration consisting of "points picking centroids", each iteration now consists of "centroids picking points". This is much more efficient, but comes at the cost of leaving some points unassigned to any centroid. We show experimentally that the number of such points is low and thus they can be separately assigned once the final centroids are decided. To speed up the computation we sparsify the centroids by pruning low weight features. Finally, to further reduce the running time and the number of unassigned points, we propose a variant of the WAND algorithm that uses the results of the intermediate results of nearest neighbor computations to improve performance. View details
    Efficiently Evaluating Graph Constraints in Content-Based Publish/Subscribe
    Shirshanka Das
    Marcus Fontoura
    Bhaskar Ghosh
    Vanja Josifovski
    Jayavel Shanmugasundaram
    The 20th International World Wide Web Confererence (WWW 2011)
    Preview
    Information extraction meets relation databases
    Davood Rafiei
    Edward Chang
    Patrick Pantel
    CIKM '09: Proceeding of the 18th ACM conference on Information and knowledge management, ACM, New York, NY, USA (2009), pp. 897-897
    Preview
    Indexing Shared Content in Information Retrieval Systems
    Nadav Eiron
    Marcus Fontoura
    Michael Herscovici
    Ronny Lempel
    John McPherson
    Runping Qi
    Eugene J. Shekita
    EDBT (2006), pp. 313-330
    Preview
    Current trends in the integration of searching and browsing
    Yoelle S. Maarek
    Krishna Bharat
    Susan T. Dumais
    Steve Papa
    Jan O. Pedersen
    WWW (Special interest tracks and posters) (2005), pp. 793
    Preview