. So data gets read once and written once.

No. You forgot that the sort needs to merge back the temporary files in to the sorted output file.

So each record is read once sorted and written once to a temporary file. It is then read again from that temporary file during the merge and written to the sorted output file. So that two reads and two writes.

And the same has to be applied to the second file.

And then both sorted files have to be read, merged and written to the final merged file.

I think the big difference is that in the hash table scenario, each disk operation is going to be getting a random record, maybe 2 different records from different parts of the file. Whereas, the merge is going to be getting thousands of sequential records and does this when one of its n queues runs dry.

Again, no. You seem to be assuming that the merge consists of: reading temp file 1 & writing it to merged file; reading temp file 2 & writing it to merged file; and so on. It doesn't.

It requires reading the first record (or block) from each of the temporary files; comparing the records frm all the files against each other and selecting teh lowest and writing it. Reading the next record from that file (or cache); repeat until the end of all the files.

There are no "thousands of sequential records". There may be occasional runs of sequential records from a given file, but each of those records needs to be compared against all of the next records from each of the other files before it is know whether it is sequentially the next record to be written or not. And sometimes, there will be runs where no two sequential records in the output file, comes from the same input file.

It is exactly this entirely unpredictable, entirely random access pattern that so badly affects caching heuristics, combined with the nearly three times as many IOPs that makes sort&merge at least an order of magnitude slower than hash&memory, all else being equal.

Please note. That isn't based on theoretical guesswork, but actual tests. And not just once. And not just recently. But many times going back many years.

Almost exactly 30 years ago, we sped up the merging if 5 sets of 6 million records using twin PDP-11/60 and (for the time, huge) "ganged" RK07 28MB disk packs, by 11 times using exactly this technique. Because the data was bigger than the combined disks we could have on line, by merging the key-fields and record offsets in memory, by loading each of the input disk packs once in turn, then the output pack once, rather than having to constantly swap input packs with temporary packs with the input packs again, then the output pack; as with the previous sort all 5 and then merge process.

And I went through the same comparative process using real files, disks and data just a couple of months ago in response to a post right here in this site. And a couple of years before than on my previous machine.

And it is neither theory nor speculation when I tell you that the differential has never been below an order of magnitude, and because I/O speeds have stagnated, whilst cpu speed have risen exponentially (until recently), and RAM sizes have and continue to grow exponentially, the ratio of performance gain of the hash&memory over sort&merge is increasing.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

In reply to Re^5: Working on huge (GB sized) files by BrowserUk
in thread Working on huge (GB sized) files by vasavi

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.