in reply to Re^3: What is the best way to store and look-up a "similarity vector"?
in thread What is the best way to store and look-up a "similarity vector" (correlating, similar, high-dimensional vectors)?

Just in order to wrap my head around what you propose:
Properties:
id|A/V-pairs
-----------------
 0|colour:red
 1|colour:green
 2|material:metal
 3|colour:blue
 4|material:wood
 5|surface:rough
And then our thing db:
things:
bitmap|thing
012345|name
-----------------
101011|red-metal-wood-rough-Thing
001101|metal-blue-rough-Thing
01    |green-Thing

Right?? So far so good, now fetching records:
sub getBits { # lookup: colour:red -> is id/bitposition:0 # lookup: material:wood -> is id/bitposition:4 } my $bits = getBits('red-wood'); # $bits is 10001 my $nBits = unpack '%32b*', $bits; # http://docstore.mik.ua/orelly/per +l/prog/ch03_182.htm : "efficiently counts the number of set bits in a + bit vector" for my $straw ( @haystack ){ # loop over all records and compare my $similarity = unpack '%32b*', $straw & needle; # compute a delt +a print "Percentage similarity %f\n", $similarity / $nBits * 100; # +delta in relation to nbits benchmark ("distance") } # then, sort by similarity
Questions:
- Did I get that right?
- Let's assume we've got millions of records, is that looping+comparing efficient?
- Any suggestions for a storage backend to implement that? Might be, there's one that offers bitmap-comparisons
  • Comment on Re^4: What is the best way to store and look-up a "similarity vector"?
  • Download Code

Replies are listed 'Best First'.
Re^5: What is the best way to store and look-up a "similarity vector"?
by BrowserUk (Patriarch) on Nov 14, 2013 at 19:50 UTC
    1. Did I get that right?

      Yes.

    2. Let's assume we've got millions of records, is that looping+comparing efficient?

      Yes. Very. By way of example, this code generates $RECS records with $BITS A/V pairs and compares a randomly generated selection set and counts those that have a greater than $TARGET 'similarity':

      #! perl -slw use strict; use Time::HiRes qw[ time ]; sub randNbits { pack 'C*', map rand(256), 1 .. ( shift() / 8 )} our $TARGET //= 0.75; # 75% match our $RECS //= 1e6; our $BITS //= 1024; my @recs; $recs[ $_ ] = randNbits( $BITS ) for 0 .. $RECS-1; my $select = randNbits( $BITS ); my $count = 0; my $start = time; for my $rec ( @recs ) { my $score = unpack( '%32b*', $select & $rec ) / $BITS; ++$count if $score >= $TARGET; } printf "Comparing $RECS records on $BITS attributes to discover $count + matches took: %.9f s/rec\n", ( time() - $start ) / $RECS;

      A few runs show that it scales extremely well:

      C:\test>1062617 -RECS=1e6 -BITS=64 -TARGET=0.33 Comparing 1e6 records on 64 attributes to discover 121722 matches took +: 0.000001229 s/rec C:\test>1062617 -RECS=1e6 -BITS=256 -TARGET=0.33 Comparing 1e6 records on 256 attributes to discover 1200 matches took: + 0.000000760 s/rec C:\test>1062617 -RECS=1e6 -BITS=1024 -TARGET=0.33 Comparing 1e6 records on 1024 attributes to discover 0 matches took: 0 +.000000936 s/rec

      No traditional SQL-based RDBS could come even close to achieving sub 1 microsecond per record fuzzy selection for 1000+ fields across 1 million records.

    3. Any suggestions for a storage backend to implement that? Might be, there's one that offers bitmap-comparisons

      It really depends upon the way this will be used. If the application is a standalone, one-run-at-a-time (ie. non-web; non client-server) affair I might opt for something like BerkeleyDB for the A/V pairs to bit-position mapping; and a simple, binary flat file for the records.

      However, if this is going to be used via a web-based or other client-server archietecture where data concurrency becomes an issue, then I would probably look to an RDBMS that supports bitstrings. Eg. Postresql bitstrings & bit varying datatype.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      (Shaking my head in disbelief of your exhaustive help and impressive algorithm savviness...)
      BrowserUk, once again top discussion and help!

      Thank you very much. Both thumbs up!.