Cuckoo Index: A Lightweight Secondary Index Structure
Abstract
In modern data warehousing, data skipping is essential for high query performance.
While index structures such as B-trees or hash tables allow for precise pruning, their large storage requirements make them impractical for indexing secondary columns.
Therefore, many systems rely on approximate indexes such as min-max sketches (ZoneMaps) or Bloom filters for cost-effective data pruning.
In this paper, we introduce Cuckoo Index (CI), an approximate secondary index structure that represents the many-to-many relationship between keys and data partitions in a highly space-efficient way.
At its core, CI associates variable-sized fingerprints in a Cuckoo filter with compressed bitmaps indicating qualifying partitions.
With our approach, we target point queries in a read-only setting and optimize for minimal space consumption.
In contrast to per-partition (Bloom) filters, CI produces correct results for lookups with keys that occur in at least one partition.
CI allows to configure the expected ratio of false positive partitions for lookups with non-occurring keys.
Our experiments with real-world data show that CI consumes significantly less space than per-partition filters for the same pruning power for low-to-medium cardinality columns.
While index structures such as B-trees or hash tables allow for precise pruning, their large storage requirements make them impractical for indexing secondary columns.
Therefore, many systems rely on approximate indexes such as min-max sketches (ZoneMaps) or Bloom filters for cost-effective data pruning.
In this paper, we introduce Cuckoo Index (CI), an approximate secondary index structure that represents the many-to-many relationship between keys and data partitions in a highly space-efficient way.
At its core, CI associates variable-sized fingerprints in a Cuckoo filter with compressed bitmaps indicating qualifying partitions.
With our approach, we target point queries in a read-only setting and optimize for minimal space consumption.
In contrast to per-partition (Bloom) filters, CI produces correct results for lookups with keys that occur in at least one partition.
CI allows to configure the expected ratio of false positive partitions for lookups with non-occurring keys.
Our experiments with real-world data show that CI consumes significantly less space than per-partition filters for the same pruning power for low-to-medium cardinality columns.