Apparently we have different assumptions about the number of temporary files needed, the size of those files, the number of records which can be read from a file in one block.
And then both sorted files have to be read, merged and written to the final merged file.
The data has to be read twice and written twice. The merging into to final form happens during 2nd write. Now of course if this data set is so humongous that multiple merge passes are required (too many files to do it at one time), then that's going to take more RW cycles! Depends upon how humongous the data set is. My assumption is that all temp files can be merged in one operation. X records from each file are queued in memory - just select lowest record of the top of all of them. repeat until one queue runs dry. recharge with another 100MB of data or however much fits in each queue, etc... The key here is that issuing one read for a bunch of records is way faster than many trips through read. Of course the problem here is that last record may not be complete and the code has to deal with that. Read cache is most likely counter productive and what I'd want is the memory mapped I/O straight from the disk.

If size of sort block is 600MB, size of data is 3GB. A simple approach results in 5 temp files (600 MB each). That's small enough number that say merge queues are 100MB, all 5 files open at once...requires 30 block reads (100MB) operations. That's what I meant by sequential blocks - the files to merged would be read not record by record but by a series of records, what I called a "block". Number of records in block is unknown depends upon data. Reading the next say 100MB of data from a file is more efficient than seeking around trying to pick off 1KB pieces at random places.

Of course all of these parameters matter. If thing are huge and we get into say 50 temporary files, things are going to slow way down because more merge passes are required. Neither of us know what any of these numbers really are in the OP's application.

And all of this has to do with how smart or not smart the system sort is on the OP's system. I wouldn't recommend that the OP recode it. For all I know the OP has a .csv file and just needs to type in the proper one liner to sort his data.

I think we've got the major issues out in the discussion.


In reply to Re^6: Working on huge (GB sized) files by Marshall
in thread Working on huge (GB sized) files by vasavi

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.