Annotating needles in the haystack without looking: Product information extraction from emails

Alex J. Smola

Amr Ahmed

Jie Yang

Vanja Josifovski

Weinan Zhang

KDD 2015

Google Scholar

Abstract

Business-to-consumer (B2C) emails are usually generated by
filling structured user data (e.g. purchase, event) into tem-
plates. Extracting structured data from B2C emails allows
users to track important information on various devices.

However, it also poses several challenges, due to the re-
quirement of short response time for massive data volume,
the diversity and complexity of templates, and the privacy
and legal constraints. Most notably, email data is legally
protected content, which means no one except the receiver
can review the messages or derived information.

In this paper we first introduce a system which can extract
structured information automatically without requiring hu-
man review of any personal content. Then we focus on how
to annotate product names from the extracted texts, which
is one of the most difficult problems in the system. Nei-
ther general learning methods, such as binary classifiers, nor
more specific structure learning methods, such as Condition-
al Random Field (CRF), can solve this problem well.

To accomplish this task, we propose a hybrid approach,
which basically trains a CRF model using the labels pre-
dicted by binary classifiers (weak learners). However, the
performance of weak learners can be low, therefore we use
Expectation Maximization (EM) algorithm on CRF to re-
move the noise and improve the accuracy, without the need
to label and inspect specific emails. In our experiments, the
EM-CRF model can significantly improve the product name
annotations over the weak learners and plain CRFs.

Defining the technology of today and tomorrow.

Philosophy

People

Research areas

Foundational ML & Algorithms

Computing Systems & Quantum AI

Science, AI & Society

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Annotating needles in the haystack without looking: Product information extraction from emails

Abstract

Research Areas

Meet the teams driving innovation