Annotating needles in the haystack without looking: Product information extraction from emails

Alex J. Smola

Amr Ahmed

Jie Yang

Vanja Josifovski

Weinan Zhang

KDD 2015

Google Scholar

Abstract

Business-to-consumer (B2C) emails are usually generated by filling structured user data (e.g. purchase, event) into tem- plates. Extracting structured data from B2C emails allows users to track important information on various devices. However, it also poses several challenges, due to the re- quirement of short response time for massive data volume, the diversity and complexity of templates, and the privacy and legal constraints. Most notably, email data is legally protected content, which means no one except the receiver can review the messages or derived information. In this paper we first introduce a system which can extract structured information automatically without requiring hu- man review of any personal content. Then we focus on how to annotate product names from the extracted texts, which is one of the most difficult problems in the system. Nei- ther general learning methods, such as binary classifiers, nor more specific structure learning methods, such as Condition- al Random Field (CRF), can solve this problem well. To accomplish this task, we propose a hybrid approach, which basically trains a CRF model using the labels pre- dicted by binary classifiers (weak learners). However, the performance of weak learners can be low, therefore we use Expectation Maximization (EM) algorithm on CRF to re- move the noise and improve the accuracy, without the need to label and inspect specific emails. In our experiments, the EM-CRF model can significantly improve the product name annotations over the weak learners and plain CRFs.

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations  & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Annotating needles in the haystack without looking: Product information extraction from emails

Abstract

Research Areas

Learn more about how we conduct our research

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Annotating needles in the haystack without looking: Product information extraction from emails

Abstract

Research Areas

Learn more about how we conduct our research

AI/ML Foundations  & Capabilities