in reply to Re: What is the best way to store and look-up a "similarity vector"?
in thread What is the best way to store and look-up a "similarity vector" (correlating, similar, high-dimensional vectors)?

Yes, totally agree. For example, where would file-systems be without bit-maps?!

My problem is that my data is more or less schema-less. I know about a certain amount of properties that might show up, but similar to the "Browser-fingerprint" problem, I might receive more and more properties I haven't planned into my schema/bitmap/bitvector layout. Knowing all varieties is then similar to an algorithm that requires me to have properties in a fixed order, to align them to each other.
  • Comment on Re^2: What is the best way to store and look-up a "similarity vector"?

Replies are listed 'Best First'.
Re^3: What is the best way to store and look-up a "similarity vector"?
by BrowserUk (Patriarch) on Nov 14, 2013 at 18:07 UTC
    I might receive more and more properties I haven't planned into my schema/bitmap/bitvector layout.

    Use a two-step schema.

    1. Each attribute/value pair gets added to a single table with an auto-incrementing primary key when it is first seen.

      Using your example above, 'color:grey', 'material:wood', & 'surface:rough' are your attribute/value pairs (A/V pairs).

      The auto-incremented 'id' value becomes that A/V pair's bit position in the data records.

    2. The data records are variable length strings of bits.

      As the number of A/V pairs increase, so do the length of (new) records, but the bit positions of existing A/V pairs retain their original meanings, so the schema does not need to change.

    To select records, you build up a bit mask with bits set for the sought A/V pairs, and then AND it with each record. For similarity, count the resultant bits.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Just in order to wrap my head around what you propose:
      Properties:
      id|A/V-pairs
      -----------------
       0|colour:red
       1|colour:green
       2|material:metal
       3|colour:blue
       4|material:wood
       5|surface:rough
      
      And then our thing db:
      things:
      bitmap|thing
      012345|name
      -----------------
      101011|red-metal-wood-rough-Thing
      001101|metal-blue-rough-Thing
      01    |green-Thing
      

      Right?? So far so good, now fetching records:
      sub getBits { # lookup: colour:red -> is id/bitposition:0 # lookup: material:wood -> is id/bitposition:4 } my $bits = getBits('red-wood'); # $bits is 10001 my $nBits = unpack '%32b*', $bits; # http://docstore.mik.ua/orelly/per +l/prog/ch03_182.htm : "efficiently counts the number of set bits in a + bit vector" for my $straw ( @haystack ){ # loop over all records and compare my $similarity = unpack '%32b*', $straw & needle; # compute a delt +a print "Percentage similarity %f\n", $similarity / $nBits * 100; # +delta in relation to nbits benchmark ("distance") } # then, sort by similarity
      Questions:
      - Did I get that right?
      - Let's assume we've got millions of records, is that looping+comparing efficient?
      - Any suggestions for a storage backend to implement that? Might be, there's one that offers bitmap-comparisons
        1. Did I get that right?

          Yes.

        2. Let's assume we've got millions of records, is that looping+comparing efficient?

          Yes. Very. By way of example, this code generates $RECS records with $BITS A/V pairs and compares a randomly generated selection set and counts those that have a greater than $TARGET 'similarity':

          #! perl -slw use strict; use Time::HiRes qw[ time ]; sub randNbits { pack 'C*', map rand(256), 1 .. ( shift() / 8 )} our $TARGET //= 0.75; # 75% match our $RECS //= 1e6; our $BITS //= 1024; my @recs; $recs[ $_ ] = randNbits( $BITS ) for 0 .. $RECS-1; my $select = randNbits( $BITS ); my $count = 0; my $start = time; for my $rec ( @recs ) { my $score = unpack( '%32b*', $select & $rec ) / $BITS; ++$count if $score >= $TARGET; } printf "Comparing $RECS records on $BITS attributes to discover $count + matches took: %.9f s/rec\n", ( time() - $start ) / $RECS;

          A few runs show that it scales extremely well:

          C:\test>1062617 -RECS=1e6 -BITS=64 -TARGET=0.33 Comparing 1e6 records on 64 attributes to discover 121722 matches took +: 0.000001229 s/rec C:\test>1062617 -RECS=1e6 -BITS=256 -TARGET=0.33 Comparing 1e6 records on 256 attributes to discover 1200 matches took: + 0.000000760 s/rec C:\test>1062617 -RECS=1e6 -BITS=1024 -TARGET=0.33 Comparing 1e6 records on 1024 attributes to discover 0 matches took: 0 +.000000936 s/rec

          No traditional SQL-based RDBS could come even close to achieving sub 1 microsecond per record fuzzy selection for 1000+ fields across 1 million records.

        3. Any suggestions for a storage backend to implement that? Might be, there's one that offers bitmap-comparisons

          It really depends upon the way this will be used. If the application is a standalone, one-run-at-a-time (ie. non-web; non client-server) affair I might opt for something like BerkeleyDB for the A/V pairs to bit-position mapping; and a simple, binary flat file for the records.

          However, if this is going to be used via a web-based or other client-server archietecture where data concurrency becomes an issue, then I would probably look to an RDBMS that supports bitstrings. Eg. Postresql bitstrings & bit varying datatype.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.