rkshyam has asked for the wisdom of the Perl Monks concerning the following question:

I have written sorting script using external sort and here is the code. Currently this is being used to sort large file size of nearly 5GB and time taken to complete the sorting is 32 minutes on windows 64 bit(64 bit active perl ) 12GB RAM.Wanted to know if the performance can be increased/improved or is this with good performance? Does my code needs any modification? I have also attached sample data to sort. Please help.

input lines: 2012/12/13 @ 13:32:35,585 @ ,, INFO [EJB3Deployer] Starting java:comp + multiplexer 2012/12/13 @ 13:32:34,585 @ ,, INFO [EJB3Deployer] Starting java:comp + multiplexer 2012/12/13 @ 12:32:35,485 @ ,, INFO [EJB3Deployer] Starting java:comp + multiplexer 2012/12/13 @ 13:35:35,585 @ ,, INFO [EJB3Deployer] Starting java:comp + multiplexer 2012/12/13 @ 14:32:35,585 @ ,, INFO [EJB3Deployer] Starting java:comp + multiplexer 2012/12/15 @ 13:32:35,612 @ ,, INFO [EJB3Deployer] Starting java:comp + multiplexer 2012/12/12 @ 11:32:35,585 @ ,, INFO [EJB3Deployer] Starting java:comp + multiplexer 2012/10/13 @ 13:32:45,735 @ ,, INFO [EJB3Deployer] Starting java:comp + multiplexer $\ = "\n" ; $, = "\t" ; use strict; use warnings; use Sort::External; print "Script start time is\n ", scalar localtime(); open DATA, "sort_input.txt"; open OUTPUT, ">>sort_output.txt"; my $sortscheme = sub { my @flds_a = split(/,,/, $Sort::External::a); my @flds_b = split(/,,/, $Sort::External::b); $flds_a[0] cmp $flds_b[0]; }; #my $temp_directory = '/home/david/temp'; my $sortex = Sort::External->new( mem_threshold => 1024**2 * 16, sortsub => $sortscheme, #working_dir => $temp_directory, ); while (<DATA>) { chomp; $sortex->feed($_);} $sortex->finish; while ( defined( $_ = $sortex->fetch ) ) { print OUTPUT $_; } close DATA; close OUTPUT; print "Script end time is\n ", scalar localtime();

Replies are listed 'Best First'.
Re: external sort performance improved?
by BrowserUk (Patriarch) on Apr 16, 2012 at 09:41 UTC

    This will do the same job, and probably substantially faster:

    \windows\system32\sort /m 5242880 sort_input.txt /O sort_output.txt

    Update: On the nearest somewhat equivalent file I had -- 6GB -- it took 14 minutes.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      i am currently hitting with the insufficient memory issue when I run this command on my system.I will debug this issue and let you know.By the way, can this command be used in script and run? or is it perl one liners? Also how is it different from external sort that I have used and now what you have mentioned? Please clarify

        i am currently hitting with the insufficient memory issue when I run this command on my system.

        With 12GB of memory and a 5GB file, this should not be happening.

        When you hit an error, if you post the error message you receive -- cut&paste rather than paraphrased -- you may get a quick solution to your problem.

        .By the way, can this command be used in script and run?

        What kind of script?

        or is it perl one liners?

        It is a bog standard windows command.

        It can be invoked: from the command line; from a batch script; from a perl script; or in any other way a system command can be invoked.

        Also how is it different from external sort that I have used and now what you have mentioned?

        The perl script you showed calls back into perl for every comparison; and (unnecessarially) re-splits two lines for every comparison.

        Assuming your example snippet lines are representative of the whole file; and assuming average number of N*log2(N) comparisons are required to sort your file, that means you are calling back into Perl 1.5 billion times and re-spliting lines 3 billion times.

        It is unsurprising that a dedicated sort utility that doesn't need to do either of those things will run more quickly.

        Please clarify

        You are sorting your data by the 1 field that appears at the beginning of each record, therefore there is no need to split the records in order to sort them correctly.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?