Just in order to wrap my head around what you propose:
Properties:
id|A/V-pairs
-----------------
0|colour:red
1|colour:green
2|material:metal
3|colour:blue
4|material:wood
5|surface:rough
And then our thing db:
things:
bitmap|thing
012345|name
-----------------
101011|red-metal-wood-rough-Thing
001101|metal-blue-rough-Thing
01 |green-Thing
Right?? So far so good, now fetching records:
sub getBits {
# lookup: colour:red -> is id/bitposition:0
# lookup: material:wood -> is id/bitposition:4
}
my $bits = getBits('red-wood'); # $bits is 10001
my $nBits = unpack '%32b*', $bits; # http://docstore.mik.ua/orelly/per
+l/prog/ch03_182.htm : "efficiently counts the number of set bits in a
+ bit vector"
for my $straw ( @haystack ){ # loop over all records and compare
my $similarity = unpack '%32b*', $straw & needle; # compute a delt
+a
print "Percentage similarity %f\n", $similarity / $nBits * 100; #
+delta in relation to nbits benchmark ("distance")
}
# then, sort by similarity
Questions:
- Did I get that right?
- Let's assume we've got millions of records, is that looping+comparing efficient?
- Any suggestions for a storage backend to implement that? Might be, there's one that offers bitmap-comparisons |