in reply to perl ST sort performance issue for large file?

As already identified, the ST works fine for smallish datasets, but consumes prodigious amounts of memory for the millions of small anonymous arrays, when used on large datasets.

In addition to that, you are compounding things by creating multiple, huge, stack-based lists in the following line:

for $key( map{ $_ -> [0] } sort{ $a->[1] cmp $b->[1] || $a->[2] cm +p $b->[2] } map{ [ $_,(split)[0],(split)[2] ] } keys %hash ) ## ^list1 ^list2 + ^list3 ^list4 ## + ^^^^^^ dups ^^^^^

And duplicating effort by performing the same split twice for every record.

If you moved your code to using a GRT (see A brief tutorial on Perl's native sorting facilities.) it would probably use about 1/4 the memory, and run in about 1/10th the time.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

Replies are listed 'Best First'.
Re^2: perl ST sort performance issue for large file?
by AnomalousMonk (Archbishop) on Mar 30, 2012 at 12:06 UTC

    salva's suggestion of using the system  sort followed by duplicate line consolidation seems the ideal way to go. It's a multi-key sort, but the keys don't seem wonky to the point  sort would choke on them; even if they were, an intermediate processing step to produce a 'normalized' file to feed to  sort would, while a bit tedious, be fairly straight-forward, quick and highly scalable. Am I missing something?

      salva pretty much wrote the book on non-native sorting, so I wasn't contradicting him. Only trying to explain to the OP why his code might be taking so long.

      That said, in general it is faster to sort stuff internal to perl (assuming sufficient memory), than to shell out to do it -- otherwise there would be no logic in having a built-in sort at all.

      With the OP having 12GB of ram; and wanting to sort a datafile of only 3GB; and the GRT usually requiring minimal extra memory over the storage required to build the array in the first place; and sorts ability to do its thing in-place on modern perls; I would not be at all surprised -- though I would have to benchmark to be sure(*) -- if using the GRT for the OPs purpose wouldn't beat using an external sort.

      Unless you have access to a version of (gnu) sort that isn't living in the stone age and only using a few 10s of MB when the machine it is running on has 10s of GBs available.

      And as the OP is a Window's user, unless he has found a version (of gnu sort) that I haven't succeeded in finding, he's out of luck on that score. Whilst the Window's supplied sort utility is very capable, and will use as much memory as is available, it is limited by only allowing a single sort key.

      (*)Which I would do, but I'm currently about 10% of the way thought thrashing the life out of my machine running something that will take around 60 hours to complete.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

        ... I'm currently about 10% of the way [through] thrashing the life out of ... something that will take around 60 hours to complete.

        ...and the best of British luck to you. Whenever I try to do something like that, my code usually crashes at about the 98% mark.