comment on

you can also throw the keys into separate files, sort the two files identically using an external disk sort, and process the two streams in a single sequential pass ... a classic “merge” operation. (What worked in the days of COBOL still works today.)

"Works" yes. Efficient? No.

That is exactly the "2 x O(NlogN) sorts + an O(N) merge" process I was decrying.

Let's say that each file consists of 100 million records.

2 * ( 1e8 * log( 1e8 ) ) + 1e8 = 1.7 billion operations.
3 * 1e8 = 0.3 billion operations.

So, theoretically 6 times longer. And in reality, usually 15 or 20 times longer, because the two sorts themselves involve multiple passes and a merge phase not taken into consideration by the above simplistic equations.

And given that many of those operations are I/O, and therefore still measured in milliseconds (little different to how they were in the '70s) rather than nanoseconds now for cpu ops (which took 10s or 100s of microseconds back in the '70s), it is easy to see that it is worth expending a lot of effort (human ingenuity and cpu cycles) to avoid moving to the disk-based approach.

If you perform the merge both ways--hash&memory .v. disksort&merge--of two files large enough that the former just fits within your given memory limit, then typically the latter will take close to 20 times longer. Then when faced with the situation where one more record pushes you beyond the memory limit, it is worth expending considerable effort to try and fit that extra record in memory and so avoid the 20x penalty.

For example, perl's hashes are designed for very general purpose usage, and as such are quite memory hungry. It is quite easy to come up with a hash-like structure that supports the subset of operations required for this particular purpose whilst using substantially less memory in the process. And even if this is implemented in pure perl and therefore considerably slower than Perl's built-in hashes, if it allows you to fit that extra record (and achieving 10 times more is possible), then you will still have saved time by avoiding the 20 times slow down.

Those old COBOL techniques worked and still do. But they were used back then because there was no feasible alternative, not because they were a good solution At a pinch, when all else fails, they are a fall-back position. A last resort, not a first.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

In reply to Re^3: Working on huge (GB sized) files by BrowserUk
in thread Working on huge (GB sized) files by vasavi

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.