Re^2: What is the best way to store and look-up a "similarity vector"?

I think, I'm applying a human perspective on information retrieval here. I didn't even thought about something as exact as hashing.
Although I remember some digests, could've been TEA, actually behave like the data I was referring to. Hashes begin to differ more when the underlying data differs more. Like

"Foo a" would hash to "a1b2c3d4", and
"Foo b" would hash to "a1b2c3d5"

- they are similar to a certain degree. A query for "a1b2c3d*" in a similar as the proposed system would return both. But the example requires the properties to adhere to a certain order.

Back to the "human perspective": what I have in mind is something like "is this your car?", you begin to apply filters: right color, make, etc. But how sure can you be, ever? "Matching" in life is an approximatory process - question is: how do I implement that as an algorithm?

Comment on Re^2: What is the best way to store and look-up a "similarity vector"?

Replies are listed 'Best First'.
Re^3: What is the best way to store and look-up a "similarity vector"? by educated_foo (Vicar) on Nov 14, 2013 at 20:15 UTC
I've used locality-sensitive hashing before; that might be something like what you remember. As for what you actually want... if I knew how to do it, I'd probably be charging a lot of money for it.	[reply]
Re^3: What is the best way to store and look-up a "similarity vector"? by oiskuu (Hermit) on Nov 15, 2013 at 09:42 UTC
Right, traditional hashing prefers to "avalanche" the bits, so that any small change will yield results far apart. The kind of hashing you need is locality-sensitive hashing, exactly as educated_foo pointed out. LSH does the opposite of normal hashing—it seeks to collide similar inputs. Located some dusty slides I remembered seeing (05-LSH). More keywords to research: Jaccard Similarity, MinHashing, Shingling, MinHash Signatures, etc. Anyway, this is a spooky topic. These techniques are useful for de-anonymizing big data.	[reply]
Re^4: What is the best way to store and look-up a "similarity vector"? by isync (Hermit) on Nov 15, 2013 at 12:47 UTC
The LSH hint brought me to even more related keywords (noted here for the accidental passer by): on Wikipedia: Nearest_neighbor_search, Locality-sensitive_hashing, Hierarchical_clustering on CPAN: Algorithm::Cluster, Algorithm::KMeans, Text::Bayon, Algorithm::LSH, Jubatus And yes, it is a spooky topic ;)	[reply]