Re: What is the best way to store and look-up a "similarity vector"?
by BrowserUk (Patriarch) on Nov 14, 2013 at 17:07 UTC
|
But is that how a project like Panopticlick calculates "similarity" across bit-vectors?
I've no idea about pana-pano, that thing, but ostensibly, it can be as simple as counting the number of matching bits (properties):
my $needle = getBits();
my $nBits = unpack '%32b*', $needle;
for my $straw ( @haystack ) {
my $similarity = unpack '%32b*', $straw & needle;
print "Percentage similarity %f\n", $similarity / $nBits * 100;
}
For parts/product catalogue type applications, bit-mapped data records can be very effective and efficient, because their attributes tend to have a fixed (and limited) number of values. Ie. Half a dozen colors; half a dozen sizes; 2 or 3 finishes; etc. That means each choice can be represented by a bit position in a vector. Selection can be extremely fast.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
|
|
Yes, totally agree. For example, where would file-systems be without bit-maps?!
My problem is that my data is more or less schema-less. I know about a certain amount of properties that might show up, but similar to the "Browser-fingerprint" problem, I might receive more and more properties I haven't planned into my schema/bitmap/bitvector layout. Knowing all varieties is then similar to an algorithm that requires me to have properties in a fixed order, to align them to each other.
| [reply] |
|
|
I might receive more and more properties I haven't planned into my schema/bitmap/bitvector layout.
Use a two-step schema.
- Each attribute/value pair gets added to a single table with an auto-incrementing primary key when it is first seen.
Using your example above, 'color:grey', 'material:wood', & 'surface:rough' are your attribute/value pairs (A/V pairs).
The auto-incremented 'id' value becomes that A/V pair's bit position in the data records.
- The data records are variable length strings of bits.
As the number of A/V pairs increase, so do the length of (new) records, but the bit positions of existing A/V pairs retain their original meanings, so the schema does not need to change.
To select records, you build up a bit mask with bits set for the sought A/V pairs, and then AND it with each record. For similarity, count the resultant bits.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
|
|
|
|
|
Re: What is the best way to store and look-up a "similarity vector"?
by educated_foo (Vicar) on Nov 14, 2013 at 16:31 UTC
|
Assuming you have a retrieval task where you need to know the "nearest neighbours"-set or drill down to find the "one exactly matching"-entry among a vast number of entries.
These are completely different problems. To find exactly matching things, you can just serialize and use hashing. To find "close" things, you need a measure of "closeness," and a much more complicated data structure.
Speaking of bit-vectors: I don't know enough about the maturity of bit-vector related modules up on CPAN to judge if they are any good for production tasks.
vec and the bitwise string operators work pretty well.
| [reply] |
|
|
I think, I'm applying a human perspective on information retrieval here. I didn't even thought about something as exact as hashing.
Although I remember some digests, could've been TEA, actually behave like the data I was referring to. Hashes begin to differ more when the underlying data differs more. Like
"Foo a" would hash to "a1b2c3d4", and
"Foo b" would hash to "a1b2c3d5"
- they are similar to a certain degree. A query for "a1b2c3d*" in a similar as the proposed system would return both. But the example requires the properties to adhere to a certain order.
Back to the "human perspective": what I have in mind is something like "is this your car?", you begin to apply filters: right color, make, etc. But how sure can you be, ever? "Matching" in life is an approximatory process - question is: how do I implement that as an algorithm? | [reply] |
|
|
I've used locality-sensitive hashing before; that might be something like what you remember. As for what you actually want... if I knew how to do it, I'd probably be charging a lot of money for it.
| [reply] |
|
|
Right, traditional hashing prefers to "avalanche" the bits, so that any small change will yield results far apart. The kind of hashing you need is locality-sensitive hashing, exactly as educated_foo pointed out. LSH does the opposite of normal hashing—it seeks to collide similar inputs.
Located some dusty slides I remembered seeing (05-LSH). More keywords to research: Jaccard Similarity, MinHashing, Shingling, MinHash Signatures, etc.
Anyway, this is a spooky topic. These techniques are useful for de-anonymizing big data.
| [reply] |
|
|
Re: What is the best way to store and look-up a "similarity vector"?
by isync (Hermit) on Nov 14, 2013 at 19:03 UTC
|
The previous sub-thread suggests, building bitmaps as record-"fingerprints" (binary patterns, based on a two-step/abstracting schema) and on look-up comparing them against a query-"fingerprint" might be is an efficient solution, as it relies on binary operators and binary (core::) functions. A low level one, though. (Implementing this might be reinventing the wheel.. search CPAN first.)
Stemming from that, looking at tools available, from what I know, inverted-index engines, like Lucy, Plucene or Sphinx, do exactly that. No? They look-up postings/terms, assign an id and store these ids in with the help of bitmaps. On lookup, they do the aforementioned comparisons and apply boolean operations (intersection). So, would it be a sane setup to use a search-engine backend as off-the-shelf bitvektor-db? Only problem I see is that applying too many filters would result in an empty set. I don't know if any of the mentioned backends allows for something like "cancel this property/filter and you'll get this more results" sort of similarity option...
| [reply] |