First, whichever way you go for sorting 20 GB of data, you will need quite a bit of disk space for intermediate storage. Even if you use a sort utility provided by your OS, it will create a number of temporary files (probably at least several hundred) on your disk. So the first thing your program should do is to check that there is ample disk space wherever the temp files will go (or you need to check manually before launching the sort). These sort utilities usually take care of removing temporary files when they don't need them anymore, but they might not be able to do it if they crash really badly.

Second, what you describe is not really what I would call sorting, but rather dispatching your data into 43 x 21 buckets (assuming from you example that there are 21 secondary keys) and then merging the buckets in the specific order of the keys, and this can be much faster than actual sorting.

I would suggest that you do create 43 x 21 = 903 files on a temporary directory on disk. You then just read your file once and dispatch the records into the proper files. This will require to open 900+ file handlers. Perl can handle without problem 1000 open filehandlers (you'll have to use an array or hash of filehandlers), so it should work; if you hit an operating system limit, then you'll have to go for two passes. Then, it is just a matter of merging back the files in the adequate order, and deleting these files as soon as you no longer need them. If it c rashes, all your files are in the same temp directory, no big deal to get rid of them.

I do not think that any other method can be faster than that.


In reply to Re: sorting type question- space problems by Laurent_R
in thread sorting type question- space problems by baxy77bax

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.