comment on

From the code, I think any string will do as a salt. All it's doing is passing it as the second argument to sha1(), and all that does is append it to the first argument before hashing. The reason it talks about salts plural is that instead of using different hashing algorithms, it gets its vector coordinates by reusing the same algorithm with a different addendum. Throw some arbitrary but consistent four letter words at it and see what happens.

More generally, it seems like a neat approach but will require at least two passes to really work. Your second pass will have to weed out the false positives by rescanning the data set with a simple hash-based tally restricted to the output of the first pass. In that case you can choose the accuracy based on the number of false positives you can afford to screen for, and even 0.1% would give you a fairly manageable hash of 30,000 keys.

Incidentally, I'm no pack() leader, but if I were you I would think about rolling your own using only some bits of Bloom::Filter. It's biased towards working with email addresses, which are very heterogeneous, and there may well be regularities in your data set (they are IDs, after all) that make other algorithms much more efficient and discriminatory than just applying and reapplying the same hashing routine.

update: according to the perl.com article, with 30million keys and 0.1% accuracy, you need 10 hash functions to get the most memory-efficient filter. For 0.01% it's 13 functions and 30% more memory use, and so on:

my ( $m, $k ) = optimise(30000000, 0.0001);

print <<"";
memory usage = $m
hash functions = $k

sub optimise {
    my ( $num_keys, $error_rate ) = @_;
    my $lowest_m;
    my $best_k = 1;
    
    foreach my $k ( 1..100 ) {
        my $m = (-1 * $k * $num_keys) / 
            ( log( 1 - ($error_rate ** (1/$k))));
    
        if ( !defined $lowest_m or ($m < $lowest_m) ) {
            $lowest_m = $m;
            $best_k   = $k;
        }
    }
    return ( $lowest_m, $best_k );
}
[download]

In reply to Re: Bloom::Filter Usage by thpfft
in thread Bloom::Filter Usage by jreades

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.