comment on

My most respected monks,
I request our insight on how more efficiently parse & manipulate large arrays. More specifically, arrays created from files. (Please allow me to present an example:
I have two or more files (file1.txt, file2.txt, etc) which contain multiple instances of things like this:

SERIAL NUMBER 1000
{
Name John Doe
Phone 555-5555
}

SERIAL NUMBER 1001
{
Name Jane Doe
Phone 555-5555
}
[download]

Let us suppose that aside from the Name and Phone, there are perhaps a dozen other 'fields' in each 'record', and that there are up to 10,000 records in each file. And therefore each of the files are anywhere from 1MB to 1.5MB in size, perhaps larger. My goal is the following:

Merge together all records ( a split(/\n{2}/, $filecontents) would give us the records for a given file)
If a duplicate serial number occurs, increment serial number until it is unique amongst all other serial numbers
Discard records for which an identical phone number already exists.

I've created a script which does this, but with large files, it has become unbelievably slow (though not CPU intensive). I need to make it faster and more efficient (usually those two go hand-in-hand). Right now my script does the following:

Read in a file into a scalar
Split scalar into an array using split(/\n{2}/, $scalar)
Iterate over each record in array, creating a hash whose key is the unique serial number, and whose value is the current record in array
If $hash{serial} already exists, increment serial in current record of array until unique, then create new key/value pair with record
Determine if phone number in current record already exists in values of hash, discard record and continue to next record

#5 above is currently being done by grepping the values of the hash for the phone number, though I know that's rather wasteful (creating a temp array to be thrown away). What I'm wrestling with is the concept of holding such a large amount of memory hostage (which is expensive) so that I can constantly check that I'm creating a new, unique hash key (and not clobbering an old value), and then grepping the values of the hash for the phone numbers to make sure I create another key/value pair containing the same phone. All of this seems quite expensive, and I'm sure that the most wise monks will have some insight. Thanks in advance.
--Kozz

In reply to Efficiency and Large Arrays by Kozz

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.