in reply to Re: What is the best way to store and look-up a "similarity vector"?
in thread What is the best way to store and look-up a "similarity vector" (correlating, similar, high-dimensional vectors)?

I think, I'm applying a human perspective on information retrieval here. I didn't even thought about something as exact as hashing.
Although I remember some digests, could've been TEA, actually behave like the data I was referring to. Hashes begin to differ more when the underlying data differs more. Like
"Foo a" would hash to "a1b2c3d4", and
"Foo b" would hash to "a1b2c3d5"
- they are similar to a certain degree. A query for "a1b2c3d*" in a similar as the proposed system would return both. But the example requires the properties to adhere to a certain order.

Back to the "human perspective": what I have in mind is something like "is this your car?", you begin to apply filters: right color, make, etc. But how sure can you be, ever? "Matching" in life is an approximatory process - question is: how do I implement that as an algorithm?
  • Comment on Re^2: What is the best way to store and look-up a "similarity vector"?

Replies are listed 'Best First'.
Re^3: What is the best way to store and look-up a "similarity vector"?
by educated_foo (Vicar) on Nov 14, 2013 at 20:15 UTC
    I've used locality-sensitive hashing before; that might be something like what you remember. As for what you actually want... if I knew how to do it, I'd probably be charging a lot of money for it.
Re^3: What is the best way to store and look-up a "similarity vector"?
by oiskuu (Hermit) on Nov 15, 2013 at 09:42 UTC
    Right, traditional hashing prefers to "avalanche" the bits, so that any small change will yield results far apart. The kind of hashing you need is locality-sensitive hashing, exactly as educated_foo pointed out. LSH does the opposite of normal hashing—it seeks to collide similar inputs.

    Located some dusty slides I remembered seeing (05-LSH). More keywords to research: Jaccard Similarity, MinHashing, Shingling, MinHash Signatures, etc.

    Anyway, this is a spooky topic. These techniques are useful for de-anonymizing big data.

      The LSH hint brought me to even more related keywords (noted here for the accidental passer by):
      • on Wikipedia: Nearest_neighbor_search, Locality-sensitive_hashing, Hierarchical_clustering
      • on CPAN: Algorithm::Cluster, Algorithm::KMeans, Text::Bayon, Algorithm::LSH, Jubatus
      And yes, it is a spooky topic ;)