FYI a simple merge-sort will take log(n)/log(2) passes through the data. You are describing a somewhat more intelligent merge that will take fewer passes. Plus you are eliminating most of the passes by using a different sort at once.

The merge is likely somewhat slower than you describe. The problem is how parallel read/write streams the drive can handle without resorting to streaming. My memory says that the limit is 4. Which means that you can at most read 3 chunks and write 1. If you try to do more, then your disk starts to thrash. If that is right, then it can handle 16 hunks in 3 passes through the dataset, where each pass has to both read and write all of the data. That is in addition to the original pass that sorts it into chunks. That means the dataset has to be streamed 8 times (4 times from disk, 4 times to). If streaming takes 1 minute per pass (that was what the OP said about his dataset) that would be 8 minutes. That is substantially faster than what I said, but I like to be conservative when I tell people that things will be slow because I don't want them to prematurely give up if it is sower than they expect.

Also note that a pure Perl implementation probably will have trouble sorting 500 MB chunks of data in RAM because of the memory overhead of internal data structures. However if you are reading from and writing to the filesystem, you can wind up creating, writing, reading and deleting a file with all operations happening in RAM without getting flushed to disk. If intelligently used, this fact can greatly increase performance because most passes do not touch disk. This can make a fairly naive merge-sort perform reasonably close to a much more optimized one.


In reply to Re^3: When the input file is huge !!! by tilly
in thread When the input file is huge !!! by sugar

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.