comment on

I have a large file (30 million records) containing a two-field key that is supposed to be unique. Unfortunately, it isn't necessarily that way... let me try to explain:

The file comes to me with these fields (amongst others):
- Account Number
- Account Open Date
- Account Close Date
Account numbers can be reused, but only if the account has been closed. Unless you have data issues. Which I do. :(
So I need a quick way to check while processing (amongst up to 30 million unique keys) whether I have seen this key before (i.e. whether it's a dupe that needs checking).
The normal hash lookup method $cache{$account} works, but with ever-decreasing performance and an ever-increasing memory profile (it got up to 1.4GB of memory usage).
So a Perl monger suggested looking at Bloom filters. More on Bloom filters is available here: Perl.com

I've found the Bloom::Filter module in CPAN but can't get it to work and am also worried about what level of false-positives I'm facing.

My current code is:

my $bloom_filter = Bloom::Filter->new(error_rate => 0.001, capacity =>
+ 30000000);

if ($bloom_filter->check($account_number)) {
     ... do deduping ...
} else {
     $bloom_filter->add($account_number);
     ... do something ...
}
[download]

I'm looking for wisdom on two fronts:

My current code just warns of a lack of salts for the filter, but I can't determine for the life of me what this module is looking for in terms of a salt as the documentation is... minimal... and everything I've tried adding causes my script to malfunction in spectacular ways
The level of false positive matches is inverseley proportional to the memory footprint of the filter but I am having trouble with the calculations that would enable me to locate the right tradeoff.

Thanks

In reply to Bloom::Filter Usage by jreades

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.