comment on

The data has to be read twice and written twice. The merging into to final form happens during 2nd write

I'm sorry, but yet again, no.

The only reason to sort, is if the dataset is too large to fit in memory. Otherwise there is absolutely no good reason to use an O(N logN) sort algorithm when an O(N) hash algorithm does the job.

Only once the dataset grown so large that it is impossible to hold all the records in memory at one time, does the sort&merge algorithm have any merit whatsoever.

And if you cannot hold the whole dataset in memory then you cannot sort it in memory is a single operation. So, you use a disk sort that reads a subset of the data, and writes the sorted subset to a temporary file. Then you read another subset into memory and sort it and write the results to a temporary file. And so on until you have S sorted subsets in temporary files. Now you need to merge those subsets together.

Read and sort, and write to temp; read from temp to merge and write to sorted. And that produced one sorted file. 2N reads; 2N writes; one sorted file.

Now you need to repeat that for the seconds file. 4N reads; 4N writes; two sorted files.

Now you need to read both sorted files, merge them and write the final merged output.

That's 8N reads; and 8N writes; one resultant merged file.

Total 16N IOPs. Compared to 6N IOPs for the hash&memory algorithm.

And if, after 3 attempts of my trying to explain this to you, you still cannot see it, please keep it to yourself, because you are simply wrong. Sorry.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

In reply to Re^7: Working on huge (GB sized) files by BrowserUk
in thread Working on huge (GB sized) files by vasavi

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.