comment on

Okay, live-ish results:

Summary:

5M small records sorted in 1.0 minutes on my system
RAM usage appeared to peak at about 522 MB.

Details:

2.78GB sitting at the command prompt
2.78GB running a doNothing.pl script (negligible overhead for Perl)
gendat.pl ran 2.5 minutes
- Generated 5M lines
- Lengths of lines were randomized from 1-100
- Key pairs were randomly generated as key1-key43 each.
2.76GB sitting at the command prompt
TwoKeySort.pl ran 1.0 minutes
3.27GB peak RAM during TwoKeySort.pl
- (0.51GB = 0.51 * 1024 = 522.25 MB.)
- A little larger than anticipated
- Either my guess at a Perl's hash overhead needs to be tripled, or;
- I missed other intermediate storage requirements
2.77GB after perl script ended and about 45 seconds for memory to be reclaimed
So it takes a little over 500MB on my machine. This is a 64-bit system so if my version of Perl does 64-bit integers this should be fairly indicative. It might need to double otherwise; leaving you with, potentially, a 1GB commit using this technique.
If 100MB was just a wag, then I would hope you wouldn't balk at using a half GB of RAM to sort a 20GB file using no intermediate disk space.
Alternatively, you could write the key/key/offset/length values to a file which would be closer to 120MB; sort it, then read it in to guide the binary random I/O process. Just replace 522MB of RAM with a 120MB file, and use an external sort routine.
The time estimate is not indicative since I had no 5MB lines to read, and I/O is still probably the slowest function on the system (though I haven't really kept up on the industry tech specs, going on assumption here).
So there you have it. I didn't cache any of the data lines (that was kind of the point of this approach) so this should be a fair representation of space consumed for 5M lines, since the size of the lines don't matter.
If you really have to keep it to 100MB, the 43 x 43 = 1849 passes through the source file might be your best bet. It will be slow, but effective.
One other possibility, if you really want to over-engineer this thing, is you could segment the work. One approach for this has already been suggested, but given the low memory usage for the hash and array data, you could consider writing just the hash out to intermediate files to be gang-sorted using some kind of segmented iterative approach.
That would be fun, but almost certainly not worth the effort.
Have fun!

In reply to Re: sorting type question- space problems by marinersk
in thread sorting type question- space problems by baxy77bax

Title:
Use: <p> text here (a paragraph) </p>
and: <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

For: Use:

& &

< <

> >

[ [

] ]
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`