Annotating needles in the haystack without looking: Product information extraction from emails
Abstract
Business-to-consumer (B2C) emails are usually generated by
filling structured user data (e.g. purchase, event) into tem-
plates. Extracting structured data from B2C emails allows
users to track important information on various devices.
However, it also poses several challenges, due to the re-
quirement of short response time for massive data volume,
the diversity and complexity of templates, and the privacy
and legal constraints. Most notably, email data is legally
protected content, which means no one except the receiver
can review the messages or derived information.
In this paper we first introduce a system which can extract
structured information automatically without requiring hu-
man review of any personal content. Then we focus on how
to annotate product names from the extracted texts, which
is one of the most difficult problems in the system. Nei-
ther general learning methods, such as binary classifiers, nor
more specific structure learning methods, such as Condition-
al Random Field (CRF), can solve this problem well.
To accomplish this task, we propose a hybrid approach,
which basically trains a CRF model using the labels pre-
dicted by binary classifiers (weak learners). However, the
performance of weak learners can be low, therefore we use
Expectation Maximization (EM) algorithm on CRF to re-
move the noise and improve the accuracy, without the need
to label and inspect specific emails. In our experiments, the
EM-CRF model can significantly improve the product name
annotations over the weak learners and plain CRFs.
filling structured user data (e.g. purchase, event) into tem-
plates. Extracting structured data from B2C emails allows
users to track important information on various devices.
However, it also poses several challenges, due to the re-
quirement of short response time for massive data volume,
the diversity and complexity of templates, and the privacy
and legal constraints. Most notably, email data is legally
protected content, which means no one except the receiver
can review the messages or derived information.
In this paper we first introduce a system which can extract
structured information automatically without requiring hu-
man review of any personal content. Then we focus on how
to annotate product names from the extracted texts, which
is one of the most difficult problems in the system. Nei-
ther general learning methods, such as binary classifiers, nor
more specific structure learning methods, such as Condition-
al Random Field (CRF), can solve this problem well.
To accomplish this task, we propose a hybrid approach,
which basically trains a CRF model using the labels pre-
dicted by binary classifiers (weak learners). However, the
performance of weak learners can be low, therefore we use
Expectation Maximization (EM) algorithm on CRF to re-
move the noise and improve the accuracy, without the need
to label and inspect specific emails. In our experiments, the
EM-CRF model can significantly improve the product name
annotations over the weak learners and plain CRFs.