Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re: Bloom::Filter Usage

by thpfft (Chaplain)
on Apr 20, 2004 at 15:40 UTC ( [id://346679]=note: print w/replies, xml ) Need Help??


in reply to Bloom::Filter Usage

From the code, I think any string will do as a salt. All it's doing is passing it as the second argument to sha1(), and all that does is append it to the first argument before hashing. The reason it talks about salts plural is that instead of using different hashing algorithms, it gets its vector coordinates by reusing the same algorithm with a different addendum. Throw some arbitrary but consistent four letter words at it and see what happens.

More generally, it seems like a neat approach but will require at least two passes to really work. Your second pass will have to weed out the false positives by rescanning the data set with a simple hash-based tally restricted to the output of the first pass. In that case you can choose the accuracy based on the number of false positives you can afford to screen for, and even 0.1% would give you a fairly manageable hash of 30,000 keys.

Incidentally, I'm no pack() leader, but if I were you I would think about rolling your own using only some bits of Bloom::Filter. It's biased towards working with email addresses, which are very heterogeneous, and there may well be regularities in your data set (they are IDs, after all) that make other algorithms much more efficient and discriminatory than just applying and reapplying the same hashing routine.

update: according to the perl.com article, with 30million keys and 0.1% accuracy, you need 10 hash functions to get the most memory-efficient filter. For 0.01% it's 13 functions and 30% more memory use, and so on:

my ( $m, $k ) = optimise(30000000, 0.0001); print <<""; memory usage = $m hash functions = $k sub optimise { my ( $num_keys, $error_rate ) = @_; my $lowest_m; my $best_k = 1; foreach my $k ( 1..100 ) { my $m = (-1 * $k * $num_keys) / ( log( 1 - ($error_rate ** (1/$k)))); if ( !defined $lowest_m or ($m < $lowest_m) ) { $lowest_m = $m; $best_k = $k; } } return ( $lowest_m, $best_k ); }

Replies are listed 'Best First'.
Re: Re: Bloom::Filter Usage
by jreades (Friar) on Apr 20, 2004 at 16:12 UTC

    This is where things get very strange for me

    Here's the code I'm running as a test:

    #!/usr/bin/perl use strict; use Digest::SHA1 qw(sha1); use Bloom::Filter; my $filter = Bloom::Filter->new(error_rate => 0.0001, capacity => 100) +; my @salts; # None of these work # Option 1: # push @salts, "msis"; # Option 2: # for my $salt (0..3) { # push @salts, sub { sha1($salt,$_[0]) }; # } # Option 3: # for my $salt (0..3) { # push @salts, sha1($salt,$_[0]); # } $filter->set_salts(@salts); $filter->add("Foo"); $filter->add("Bar"); $filter->add("Baz"); print STDOUT ($filter->check("Bar") ? "Found" : "Not Found"), "\n"; print STDOUT ($filter->check("Bim") ? "Found" : "Not Found"), "\n"; exit;
    it looks like one of these should have worked, but when I run my test code I always get the following:

    jreades@sort:~>> ./test.pl Use of uninitialized value in numeric gt (>) at Bloom/Filter.pm line 1 +26. Argument "" isn't numeric in numeric eq (==) at Bloom/Filter.pm line 8 +6. Argument "@" isn't numeric in numeric eq (==) at Bloom/Filter.pm line +86. Found Argument "" isn't numeric in numeric eq (==) at Bloom/Filter.pm line 8 +6. Argument "" isn't numeric in numeric eq (==) at Bloom/Filter.pm line 8 +6. Found

    It's just possible that there's something wrong with the module itself, and I've emailed the author asking for any tips or tricks but I haven't heard back from him/her yet.

    And yes, I do think that the approach is pretty neat -- when someone suggested it and I did some reading it leapt out as a very low-cost way to perform a high-cost operation. And, as you said, even taking a fairly high false-positive rate of 0.1% you still end up with a tiny fraction of your original search space.

    I did notice the odd bias towards email addresses but figured it might not affect what I was trying to do. What algorithm would you suggest as an alternative for working with 12-digit numeric keys more efficiently?

    TIA

      On closer inspection, Bloom::Filter is a bit broken, unless I'm badly misreading it:

      * The add() method doesn't actually add anything, because the new() method doesn't initialise the $self->{contents} hashref. it needs to be changed so that either the add method assigns directly to $self->{contents}, or the new() method changes 'keys' to 'contents'. that's probably just a typo.

      * because $self->{contents} is not populated, _calculate_filter_length() always returns undef, so the default filter length is always used.

      * because $self->{contents} is not populated, build_filter doesn't build a filter anyway.

      * the use of == to compare bitstrings in check() is generating warnings, as you've seen. Its purpose is to test that ($a & $b) is the same as $a, ie that all the on bits in $a are also on in $b, and never mind the off bits. Someone better than me at pack(), ie anyone at all, will know how to test equivalence using the bitstrings themselves: meanwhile, you can make it work by testing with the unpacked versions.

      * And anyway, it's not incremental. Every time you add a new key, the whole filter is rebuilt by way of a loop on $self->{contents}. This makes it more or less useless for your purposes: you presumably want an iterative check($_) || add($_) mechanism that can be placed in the equivalent of a while(<HUGE_FILE>) loop. As it stands, Bloom::Filter will perform another loop across your whole dataset-so-far for each input line, which might slow things down a bit.

      You will need to roll your own, I think, unless the author can be persuaded to accommodate both approaches as well as fixing the bugs, but if you patch Bloom::Filter with this, at least your test script should work:

      --- Filter.orig.pm Tue Apr 20 20:01:37 2004 +++ Filter.pm Tue Apr 20 20:04:34 2004 @@ -25,7 +25,7 @@ min_length => 20, filter_length => 20, %params, - keys => {} + contents => {} }, $class; } @@ -52,7 +52,7 @@ my $list = $self->{contents}; foreach my $add ( @addresses ) { # convert email to SHA1 hash if necessary - $add = sha1_base64( $add ) if $add =~ /@/o; + #$add = sha1_base64( $add ) if $add =~ /@/o; $list->{$add}++; } $self->{reindex_flag} = 1; @@ -83,7 +83,11 @@ # A match occurs if every bit we check is on foreach my $key ( @keys ) { my $mask = $self->_make_bitmask( $key ); - push @result, ($mask == ( $mask & $self->{filter} )); + #push @result, ($mask == ( $mask & $self->{filter} )); + + my $m = unpack("b*", $mask); + push @result, ($m == unpack("b*", ($mask & $self->{filter}))) +; + } return ( wantarray() ? @result : $result[0] ); }
        I am the module author...
        Sorry - looks like an update I thought I had made to CPAN before publishing the article never got through. There should be a v0.02 now on CPAN (same as the code in Listing 1 linked from the article) that has the bugs fixed.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://346679]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (5)
As of 2024-03-29 09:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found