Re: What is the best way to store and look-up a "similarity vector"?

But is that how a project like Panopticlick calculates "similarity" across bit-vectors?

I've no idea about pana-pano, that thing, but ostensibly, it can be as simple as counting the number of matching bits (properties):

my $needle = getBits();
my $nBits = unpack '%32b*', $needle;

for my $straw ( @haystack ) {
    my $similarity = unpack '%32b*', $straw & needle;
    print "Percentage similarity %f\n", $similarity / $nBits * 100;
}
[download]

For parts/product catalogue type applications, bit-mapped data records can be very effective and efficient, because their attributes tend to have a fixed (and limited) number of values. Ie. Half a dozen colors; half a dozen sizes; 2 or 3 finishes; etc. That means each choice can be represented by a bit position in a vector. Selection can be extremely fast.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

Comment on Re: What is the best way to store and look-up a "similarity vector"? Download Code

Replies are listed 'Best First'.
Re^2: What is the best way to store and look-up a "similarity vector"? by isync (Hermit) on Nov 14, 2013 at 17:38 UTC
Yes, totally agree. For example, where would file-systems be without bit-maps?! My problem is that my data is more or less schema-less. I know about a certain amount of properties that might show up, but similar to the "Browser-fingerprint" problem, I might receive more and more properties I haven't planned into my schema/bitmap/bitvector layout. Knowing all varieties is then similar to an algorithm that requires me to have properties in a fixed order, to align them to each other.	[reply]
Re^3: What is the best way to store and look-up a "similarity vector"? by BrowserUk (Patriarch) on Nov 14, 2013 at 18:07 UTC
I might receive more and more properties I haven't planned into my schema/bitmap/bitvector layout. Use a two-step schema. Each attribute/value pair gets added to a single table with an auto-incrementing primary key when it is first seen. Using your example above, 'color:grey', 'material:wood', & 'surface:rough' are your attribute/value pairs (A/V pairs). The auto-incremented 'id' value becomes that A/V pair's bit position in the data records. The data records are variable length strings of bits. As the number of A/V pairs increase, so do the length of (new) records, but the bit positions of existing A/V pairs retain their original meanings, so the schema does not need to change. To select records, you build up a bit mask with bits set for the sought A/V pairs, and then AND it with each record. For similarity, count the resultant bits. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^4: What is the best way to store and look-up a "similarity vector"? by isync (Hermit) on Nov 14, 2013 at 18:54 UTC
Just in order to wrap my head around what you propose: Properties: id\|A/V-pairs ----------------- 0\|colour:red 1\|colour:green 2\|material:metal 3\|colour:blue 4\|material:wood 5\|surface:rough And then our thing db: things: bitmap\|thing 012345\|name ----------------- 101011\|red-metal-wood-rough-Thing 001101\|metal-blue-rough-Thing 01 \|green-Thing Right?? So far so good, now fetching records: sub getBits { # lookup: colour:red -> is id/bitposition:0 # lookup: material:wood -> is id/bitposition:4 } my $bits = getBits('red-wood'); # $bits is 10001 my $nBits = unpack '%32b', $bits; # http://docstore.mik.ua/orelly/per +l/prog/ch03_182.htm : "efficiently counts the number of set bits in a + bit vector" for my $straw ( @haystack ){ # loop over all records and compare my $similarity = unpack '%32b', $straw & needle; # compute a delt +a print "Percentage similarity %f\n", $similarity / $nBits * 100; # +delta in relation to nbits benchmark ("distance") } # then, sort by similarity [download] Questions: - Did I get that right? - Let's assume we've got millions of records, is that looping+comparing efficient? - Any suggestions for a storage backend to implement that? Might be, there's one that offers bitmap-comparisons	[reply] [d/l]
Re^5: What is the best way to store and look-up a "similarity vector"? by BrowserUk (Patriarch) on Nov 14, 2013 at 19:50 UTC
Re^6: What is the best way to store and look-up a "similarity vector"? by isync (Hermit) on Nov 14, 2013 at 20:48 UTC