Permutation Indexing: Fast Approximate Retrieval from Large Corpora

Maxim Gurevich; Tamas Sarlos

Permutation Indexing: Fast Approximate Retrieval from Large Corpora

Maxim Gurevich

Tamas Sarlos

22nd International Conference on Information and Knowledge Management (CIKM), ACM (2013)

Google Scholar

Abstract

Inverted indexing is a ubiquitous technique used in retrieval systems including web search. Despite its popularity, it has a drawback - query retrieval time is highly variable and grows with the corpus size. In this work we propose an alternative technique, permutation indexing, where retrieval cost is strictly bounded and has only logarithmic dependence on the corpus size. Our approach is based on two novel techniques:

partitioning of the term space into overlapping clusters of terms that frequently co-occur in queries, and
a data structure for compactly encoding results of all queries composed of terms in a cluster as continuous sequences of document ids.

Then, query results are retrieved by fetching few small chunks of these sequences. There is a price though: our encoding is lossy and thus returns approximate result sets. The fraction of the true results returned, recall, is controlled by the level of redundancy. The more space is allocated for the permutation index the higher is the recall. We analyze permutation indexing both theoretically under simplified document and query models, and empirically on a realistic document and query collections. We show that although permutation indexing can not replace traditional retrieval methods, since high recall cannot be guaranteed on all queries, it covers up to 77% of tail queries and can be used to speed up retrieval for these queries.

Research Areas

Algorithms and theory

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Permutation Indexing: Fast Approximate Retrieval from Large Corpora

Abstract

Research Areas

Meet the teams driving innovation

Google AI

Google Cloud

Google DeepMind

Google Labs