Re^3: What is the best way to store and look-up a "similarity vector"?

Right, traditional hashing prefers to "avalanche" the bits, so that any small change will yield results far apart. The kind of hashing you need is locality-sensitive hashing, exactly as educated_foo pointed out. LSH does the opposite of normal hashing—it seeks to collide similar inputs.

Located some dusty slides I remembered seeing (05-LSH). More keywords to research: Jaccard Similarity, MinHashing, Shingling, MinHash Signatures, etc.

Anyway, this is a spooky topic. These techniques are useful for de-anonymizing big data.

Comment on Re^3: What is the best way to store and look-up a "similarity vector"?

Replies are listed 'Best First'.
Re^4: What is the best way to store and look-up a "similarity vector"? by isync (Hermit) on Nov 15, 2013 at 12:47 UTC
The LSH hint brought me to even more related keywords (noted here for the accidental passer by): on Wikipedia: Nearest_neighbor_search, Locality-sensitive_hashing, Hierarchical_clustering on CPAN: Algorithm::Cluster, Algorithm::KMeans, Text::Bayon, Algorithm::LSH, Jubatus And yes, it is a spooky topic ;)	[reply]