Improved Consistent Sampling, Weighted Minhash and L1 Sketching
Abstract
We propose a new Consistent Weighted Sampling method, where the probability of
drawing identical samples for a pair of inputs is equal to their Jaccard
similarity. Our method takes deterministic constant time per non-zero weight,
improving on the best previous approach which takes expected constant time. The
samples can be used as Weighted Minhash for efficient retrieval and compression
(sketching) under Jaccard or L1 metric. A method is presented for using simple
data statistics to reduce the running time of hash computation by two orders of
magnitude. We compare our method with the random projection method and show
that it has better characteristics for retrieval under L1. We present a novel
method of mapping hashes to short bit-strings, apply it to Weighted Minhash, and
achieve more accurate distance estimates from sketches than existing methods, as
long as the inputs are sufficiently distinct. We show how to choose the optimal
number of bits per hash for sketching, and demonstrate experimental results
which agree with the theoretical analysis.
drawing identical samples for a pair of inputs is equal to their Jaccard
similarity. Our method takes deterministic constant time per non-zero weight,
improving on the best previous approach which takes expected constant time. The
samples can be used as Weighted Minhash for efficient retrieval and compression
(sketching) under Jaccard or L1 metric. A method is presented for using simple
data statistics to reduce the running time of hash computation by two orders of
magnitude. We compare our method with the random projection method and show
that it has better characteristics for retrieval under L1. We present a novel
method of mapping hashes to short bit-strings, apply it to Weighted Minhash, and
achieve more accurate distance estimates from sketches than existing methods, as
long as the inputs are sufficiently distinct. We show how to choose the optimal
number of bits per hash for sketching, and demonstrate experimental results
which agree with the theoretical analysis.