Re^2: What is the best way to store and look-up a "similarity vector"?

Replies are listed 'Best First'.
Re^3: What is the best way to store and look-up a "similarity vector"? by BrowserUk (Patriarch) on Nov 14, 2013 at 18:07 UTC
I might receive more and more properties I haven't planned into my schema/bitmap/bitvector layout. Use a two-step schema. Each attribute/value pair gets added to a single table with an auto-incrementing primary key when it is first seen. Using your example above, 'color:grey', 'material:wood', & 'surface:rough' are your attribute/value pairs (A/V pairs). The auto-incremented 'id' value becomes that A/V pair's bit position in the data records. The data records are variable length strings of bits. As the number of A/V pairs increase, so do the length of (new) records, but the bit positions of existing A/V pairs retain their original meanings, so the schema does not need to change. To select records, you build up a bit mask with bits set for the sought A/V pairs, and then AND it with each record. For similarity, count the resultant bits. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^4: What is the best way to store and look-up a "similarity vector"? by isync (Hermit) on Nov 14, 2013 at 18:54 UTC
Just in order to wrap my head around what you propose: Properties: id\|A/V-pairs ----------------- 0\|colour:red 1\|colour:green 2\|material:metal 3\|colour:blue 4\|material:wood 5\|surface:rough And then our thing db: things: bitmap\|thing 012345\|name ----------------- 101011\|red-metal-wood-rough-Thing 001101\|metal-blue-rough-Thing 01 \|green-Thing Right?? So far so good, now fetching records: sub getBits { # lookup: colour:red -> is id/bitposition:0 # lookup: material:wood -> is id/bitposition:4 } my $bits = getBits('red-wood'); # $bits is 10001 my $nBits = unpack '%32b', $bits; # http://docstore.mik.ua/orelly/per +l/prog/ch03_182.htm : "efficiently counts the number of set bits in a + bit vector" for my $straw ( @haystack ){ # loop over all records and compare my $similarity = unpack '%32b', $straw & needle; # compute a delt +a print "Percentage similarity %f\n", $similarity / $nBits * 100; # +delta in relation to nbits benchmark ("distance") } # then, sort by similarity [download] Questions: - Did I get that right? - Let's assume we've got millions of records, is that looping+comparing efficient? - Any suggestions for a storage backend to implement that? Might be, there's one that offers bitmap-comparisons	[reply] [d/l]
Re^5: What is the best way to store and look-up a "similarity vector"? by BrowserUk (Patriarch) on Nov 14, 2013 at 19:50 UTC
Did I get that right? Yes. Let's assume we've got millions of records, is that looping+comparing efficient? Yes. Very. By way of example, this code generates $RECS records with $BITS A/V pairs and compares a randomly generated selection set and counts those that have a greater than $TARGET 'similarity': #! perl -slw use strict; use Time::HiRes qw[ time ]; sub randNbits { pack 'C', map rand(256), 1 .. ( shift() / 8 )} our $TARGET //= 0.75; # 75% match our $RECS //= 1e6; our $BITS //= 1024; my @recs; $recs[ $_ ] = randNbits( $BITS ) for 0 .. $RECS-1; my $select = randNbits( $BITS ); my $count = 0; my $start = time; for my $rec ( @recs ) { my $score = unpack( '%32b', $select & $rec ) / $BITS; ++$count if $score >= $TARGET; } printf "Comparing $RECS records on $BITS attributes to discover $count + matches took: %.9f s/rec\n", ( time() - $start ) / $RECS; [download] A few runs show that it scales extremely well: `C:\test>1062617 -RECS=1e6 -BITS=64 -TARGET=0.33 Comparing 1e6 records on 64 attributes to discover 121722 matches took +: 0.000001229 s/rec C:\test>1062617 -RECS=1e6 -BITS=256 -TARGET=0.33 Comparing 1e6 records on 256 attributes to discover 1200 matches took: + 0.000000760 s/rec C:\test>1062617 -RECS=1e6 -BITS=1024 -TARGET=0.33 Comparing 1e6 records on 1024 attributes to discover 0 matches took: 0 +.000000936 s/rec` [download] No traditional SQL-based RDBS could come even close to achieving sub 1 microsecond per record fuzzy selection for 1000+ fields across 1 million records. Any suggestions for a storage backend to implement that? Might be, there's one that offers bitmap-comparisons It really depends upon the way this will be used. If the application is a standalone, one-run-at-a-time (ie. non-web; non client-server) affair I might opt for something like BerkeleyDB for the A/V pairs to bit-position mapping; and a simple, binary flat file for the records. However, if this is going to be used via a web-based or other client-server archietecture where data concurrency becomes an issue, then I would probably look to an RDBMS that supports bitstrings. Eg. Postresql bitstrings & bit varying datatype. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^6: What is the best way to store and look-up a "similarity vector"? by isync (Hermit) on Nov 14, 2013 at 20:48 UTC