comment on

The size of your hashes in memory is going to be your limiting factor.

You say each of your files is approx. 2GB in size, and each element is "several lines". You certainly won't be able to store a hash containing all the data in both files in memory on most normal PCs that typically have a 2GB/process ram limit.

If your paths are an average of say 5 lines of 80 chars in length, then that gives approximately 5 million paths in each file. And if the keys to your paths were say 10 chars in length, then you could represent each file in memory with a hash that has 5 million 10-char keys and a single integer value--the byte offset into the file. That approximates to 250MB per file, which would theoretically allow you to index both your huge files in memory simultaneously, if the machine this is going to run on has a full complement of ram.

However, if your paths are substantially less in average length (giving more paths) or your keys are substantially longer, then memory requirement for each hash grows and you start to move to the point where you are pushing the limits of what is possible.

Your alternative in that case is to look at further reducing the memory requirements of storing the index, which basically means getting into using (or writing) some kind of less memory hungry alternative to hashes--Tie::SubstrHash comes with (most) distributions of perl, and whilst it's a bit awkward to use, and substantially slower than a real hash, it does reduce the memory requirements markedly.

Or, you could load your data into some flavour of DB.

A Berkely DB might fit the bill, but it requires moving your datafiles into the Berkeley format. This is a fairly slow process, and the resultant files will be much larger than the equivalent flats files. Once transfered, and with the optimal configuration, the actual lookups during processing will be quite fast say 4x to 10x slower than a normal hash.

Or you could use a general purpose RDBMS. Again the initial loading of the flatfiles into the DB will require a fair amount of time, and the diskspace requirement climbs substantially, and as each lookup will require a separate SQL statement with it's communications overhead, so the processing stage will be substantially slower than the Berkeley option.

Which approach is best for your situation really depends on how often you would need to do the processing, how often you would need to reload the DB, and whether you could change the process that produces the two files to write the data directly to one form of DB or the other.

You haven't provided any clues to the numbers of records (paths), or their sizes, or the key lengths, or the frequencies of the tasks, so the best that anyone can do is give you a very general reply.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

Lingua non convalesco, consenesco et abolesco.

Rule 1 has a caveat! -- Who broke the cabal?

In reply to Re^3: huge file->hash by BrowserUk
in thread huge file->hash by ISAI student

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.