comment on

You might consider attacking the problem from the other end. That is, develop a list of possible duplicate accounts and then checking the complete list against it. I assume that you need to perform some processing on the record before inserting it even if it is not duplicate and probably some other processing if is might be a duplicate. Thus, most of the time on this program will be in processing the records. Using some additional parsing tools might represent a small cost in time to simplify the program considerably.

For example, you could create a new file, possibledups.txt, by parsing the original file using awk to get the account number, open and close dates. Pipe this result to sort and unique to get a (much smaller) list of possible duplicate accounts. Something like ...

gawk [some parse code] | sort | uniq -d > possibledups.txt

The processing script then, can read this duplicate file into a hash first. Then, as the script reads each record from the master file, it can compare those results against the hash of possible dups. That way, your code is spending most of its processing effort working on known unique records (which is probably simpler and faster than working on dups). In my experience, this approach can often simplify coding since the exception condition (duplicates) can be moved off to another subroutine.

ps. As a test, I generated a random file containing 30 million records and processed it using this pipe set in about 9 minutes (milage may vary).

PJ

In reply to Re: Bloom::Filter Usage by periapt
in thread Bloom::Filter Usage by jreades

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.