Okay, live-ish results:
Summary:
Details:
So it takes a little over 500MB on my machine. This is a 64-bit system so if my version of Perl does 64-bit integers this should be fairly indicative. It might need to double otherwise; leaving you with, potentially, a 1GB commit using this technique.
If 100MB was just a wag, then I would hope you wouldn't balk at using a half GB of RAM to sort a 20GB file using no intermediate disk space.
Alternatively, you could write the key/key/offset/length values to a file which would be closer to 120MB; sort it, then read it in to guide the binary random I/O process. Just replace 522MB of RAM with a 120MB file, and use an external sort routine.
The time estimate is not indicative since I had no 5MB lines to read, and I/O is still probably the slowest function on the system (though I haven't really kept up on the industry tech specs, going on assumption here).
So there you have it. I didn't cache any of the data lines (that was kind of the point of this approach) so this should be a fair representation of space consumed for 5M lines, since the size of the lines don't matter.
If you really have to keep it to 100MB, the 43 x 43 = 1849 passes through the source file might be your best bet. It will be slow, but effective.
One other possibility, if you really want to over-engineer this thing, is you could segment the work. One approach for this has already been suggested, but given the low memory usage for the hash and array data, you could consider writing just the hash out to intermediate files to be gang-sorted using some kind of segmented iterative approach.
That would be fun, but almost certainly not worth the effort.
Have fun!
In reply to Re: sorting type question- space problems
by marinersk
in thread sorting type question- space problems
by baxy77bax
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |