in reply to Bloom Filter or other mehod to store URL's?
If you use the (binary) md5 of the url as the hash key and don't assign anything as the value (ie. use undef $hash{ md5( $url ) }; to autovivify the key), then storing 10 million urls will require around 1 GB of ram.
If you preallocated enough buckets (keys %hash = 2 **24;), then it runs pretty quickly too.
There is the rare possibility that you will get a false positive by finding two urls that hash to the same md5, but the chances are less than with a bloom filter and if you are using md5 for your Bloom::Filter solution, that would be possible anyway.
|
|---|