in reply to Re^2: What is the best way to store and look-up a "similarity vector"?
in thread What is the best way to store and look-up a "similarity vector" (correlating, similar, high-dimensional vectors)?

Right, traditional hashing prefers to "avalanche" the bits, so that any small change will yield results far apart. The kind of hashing you need is locality-sensitive hashing, exactly as educated_foo pointed out. LSH does the opposite of normal hashing—it seeks to collide similar inputs.

Located some dusty slides I remembered seeing (05-LSH). More keywords to research: Jaccard Similarity, MinHashing, Shingling, MinHash Signatures, etc.

Anyway, this is a spooky topic. These techniques are useful for de-anonymizing big data.

  • Comment on Re^3: What is the best way to store and look-up a "similarity vector"?

Replies are listed 'Best First'.
Re^4: What is the best way to store and look-up a "similarity vector"?
by isync (Hermit) on Nov 15, 2013 at 12:47 UTC
    The LSH hint brought me to even more related keywords (noted here for the accidental passer by):
    • on Wikipedia: Nearest_neighbor_search, Locality-sensitive_hashing, Hierarchical_clustering
    • on CPAN: Algorithm::Cluster, Algorithm::KMeans, Text::Bayon, Algorithm::LSH, Jubatus
    And yes, it is a spooky topic ;)