Google Research

Google Patent Phrase Similarity Dataset


A human rated contextual phrase-to-phrase matching dataset focused on technical terms from patents. In addition to similarity scores that are typically included in other benchmark datasets, we include granular rating classes similar to WordNet, such as synonym, antonym, hypernym, hyponym, holonym, meronym, and domain related. The dataset was used in the U.S. Patent Phrase to Phrase Matching competition.

To better train the next generation of state-of-the-art models, we created the Patent Phrase Similarity dataset that focuses on addressing the following problems:

  • Phrase disambiguation: some keywords and phrases can have multiple meanings (e.g., the phrase "mouse" may refer to an animal or a computer input device), so we disambiguate the phrases by including Cooperative Patent Classification (CPC) classes with each pair of phrases.

  • Adversarial keyword match: many NLP models will not do well on data (e.g., bag of words models) with phrases that have matching keywords but are otherwise unrelated (e.g., “container section” → “kitchen container”, “offset table” → “table fan”). The Patent Phrase Similarity dataset is designed to include many examples of matching keywords that are unrelated through adversarial keyword match, enabling NLP models to improve their performance.

  • Hard negative keywords: keywords that are unrelated but received a high score for similarity from other models.

Each entry of the dataset contains two phrases - anchor and target, a context CPC class, a rating class, and a similarity score. The rating classes have the following meanings:

  • 4 - Very high
  • 3 - High
  • 2 - Medium
  • 2a - Hyponym (broad-narrow match)
  • 2b - Hypernym (narrow-broad match)
  • 2c - Structural match
  • 1 - Low
  • 1a - Antonym
  • 1b - Meronym (a part of)
  • 1c - Holonym ( a whole of)
  • 1d - Other high level domain match
  • 0 - Not related

The dataset contains 48,548 entries, split into training (75%), validation (5%), and test (20%) sets. When splitting the data all of the entries with the same anchor are kept together in the same set, with a total of 973 unique anchors. There are 106 different context CPC classes and all of them are represented in the training set.

More details about the dataset are available in the corresponding paper and blog post. Please cite the paper if you use the dataset.