Ouch! Your machine must be slower; or your actual data more complex than my quick mockup. It only took about 20 minutes on my machine for approximately the same size file. Sorry for the bum steer.

Looking back at your OP, you might gain a little by changing your sort sub from:

my $sortscheme = sub { my @flds_a = split(/,,/, $Sort::External::a); my @flds_b = split(/,,/, $Sort::External::b); $flds_a[0] cmp $flds_b[0]; };

To:

my $sortscheme = sub { substr( $Sort::External::a, 0, 25 ) cmp substr( + $Sort::External::b, 0, 25 ); };

But I wouldn't expect it to make a great difference.


Another idea that might speed things up, but it will only work if your records are, (as they appear to be in your samples), fixed length?

The following is a single command and should be entered as a single line in your command prompt:

perl -nle"print substr( $_, 21 ) . substr($_, 0, 21 )" dataf | \windows\system32\sort /M 5242880 /+62 | perl -nle"print substr($_,62) . substr($_,0,-21)" > dataf.sorted

It should run at very nearly the same speed as the original windows sort version, but this time I work around the no-key-length limitation of that sort program by:

  1. using perl to take the significant portion of each record and swap it to the end;
  2. using the /+62 offset parameter to sort on only that portion of the record;
  3. use perl again to switch the two parts back around again.

As the two perl processes are O(N) and run substantially in parallel with the O(NlogN) sort, the runtime should be little changed over the original windows sort version -- assuming that was actually quicker for you.


Another approach -- assuming that 12GB machine of yours also has multiple processors and can see multiple disks-- would be to split the file into N chunks, one per procesor; distribute those chunks onto different drives; and the run gnusort on each of the N chunks concurrently.

When the N files are sorted, you can use the gnusort switch -m to merge the partial sorts together into a finally sorted file.

But having multiple drives is crucial for this to work. Otherwise, IO contention will probably kill any gains through through the parallelism.


Finally, assuming you are going to be doing this regularly -- this thread would be a waste of time if you are doing it only once :) -- then probably the best gains you could get would be the purchase of a SSD.

The fastest of these are now upto several hundred (if not a 1000) times faster than harddrives, though the fastest are the PCIe based devices which tend to cost £1000+.

But even the much cheaper consumer grade devices -- ~£100 for a 60GB sata3 -- are a couple of 100 times faster and will make a considerable difference to your elapsed time.

Combine one of those with the above split - parallel sort - merge mechanism -- which will work well putting all the chunks on the same SSD as they do not have heads so no seek-time losses -- and you ought to be able to cut your runtime close to 32mins / number of processors.

I hope at least one of these ideas is useful to you.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?


In reply to Re^9: external sort performance improved? by BrowserUk
in thread external sort performance improved? by rkshyam

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.