Google Research

Annotating needles in the haystack without looking: Product information extraction from emails

KDD 2015

Abstract

Business-to-consumer (B2C) emails are usually generated by filling structured user data (e.g. purchase, event) into tem- plates. Extracting structured data from B2C emails allows users to track important information on various devices.

However, it also poses several challenges, due to the re- quirement of short response time for massive data volume, the diversity and complexity of templates, and the privacy and legal constraints. Most notably, email data is legally protected content, which means no one except the receiver can review the messages or derived information.

In this paper we first introduce a system which can extract structured information automatically without requiring hu- man review of any personal content. Then we focus on how to annotate product names from the extracted texts, which is one of the most difficult problems in the system. Nei- ther general learning methods, such as binary classifiers, nor more specific structure learning methods, such as Condition- al Random Field (CRF), can solve this problem well.

To accomplish this task, we propose a hybrid approach, which basically trains a CRF model using the labels pre- dicted by binary classifiers (weak learners). However, the performance of weak learners can be low, therefore we use Expectation Maximization (EM) algorithm on CRF to re- move the noise and improve the accuracy, without the need to label and inspect specific emails. In our experiments, the EM-CRF model can significantly improve the product name annotations over the weak learners and plain CRFs.

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work