comment on

Did I get that right?

Yes.

Let's assume we've got millions of records, is that looping+comparing efficient?

Yes. Very. By way of example, this code generates $RECS records with $BITS A/V pairs and compares a randomly generated selection set and counts those that have a greater than $TARGET 'similarity':

#! perl -slw
use strict;
use Time::HiRes qw[ time ];

sub randNbits {  pack 'C*', map rand(256), 1 .. ( shift() / 8 )}

our $TARGET //= 0.75; # 75% match
our $RECS //= 1e6;
our $BITS //= 1024;

my @recs;
$recs[ $_ ] = randNbits( $BITS ) for 0 .. $RECS-1;

my $select = randNbits( $BITS );
my $count = 0;
my $start = time;

for my $rec ( @recs ) {
    my $score = unpack( '%32b*', $select & $rec ) / $BITS;
    ++$count if $score >= $TARGET;
}

printf "Comparing $RECS records on $BITS attributes to discover $count
+ matches took: %.9f s/rec\n",
    ( time() - $start ) / $RECS;
[download]

A few runs show that it scales extremely well:

C:\test>1062617 -RECS=1e6 -BITS=64 -TARGET=0.33
Comparing 1e6 records on 64 attributes to discover 121722 matches took
+: 0.000001229 s/rec

C:\test>1062617 -RECS=1e6 -BITS=256 -TARGET=0.33
Comparing 1e6 records on 256 attributes to discover 1200 matches took:
+ 0.000000760 s/rec

C:\test>1062617 -RECS=1e6 -BITS=1024 -TARGET=0.33
Comparing 1e6 records on 1024 attributes to discover 0 matches took: 0
+.000000936 s/rec
[download]

No traditional SQL-based RDBS could come even close to achieving sub 1 microsecond per record fuzzy selection for 1000+ fields across 1 million records.

Any suggestions for a storage backend to implement that? Might be, there's one that offers bitmap-comparisons

It really depends upon the way this will be used. If the application is a standalone, one-run-at-a-time (ie. non-web; non client-server) affair I might opt for something like BerkeleyDB for the A/V pairs to bit-position mapping; and a simple, binary flat file for the records.
However, if this is going to be used via a web-based or other client-server archietecture where data concurrency becomes an issue, then I would probably look to an RDBMS that supports bitstrings. Eg. Postresql bitstrings & bit varying datatype.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

In reply to Re^5: What is the best way to store and look-up a "similarity vector"? by BrowserUk
in thread What is the best way to store and look-up a "similarity vector" (correlating, similar, high-dimensional vectors)? by isync

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.