How to sort a large flat file 90mb with PERL-- various ways/tradeoffs/watchouts

maxon11 has asked for the wisdom of the Perl Monks concerning the following question:

... DEAR PERL MONKS , I am a newbie to PERL who has kind of dove into the deep end of the pool and have done so without fully reading up on all PERL books and materials.. But I think my questions are pretty serious and should be easy for a PERL veteran like yourself to answer --- below is a LOG of a Problem report I submitted today regarding a SORT pm module called: File-Sort I found on the web using google at CPAN. I am using the sort.pudge file to get around not being able to install the module myself to @INC which is locked down at my site. File-Sort version 1.00 or 1.01 ????

This is perl, version 5.005_03 built for aix
AIX pbsxdr00041 2 5 000258CA4C00
>perl   sort.pudge     h2z.prod.400987aa  > h2z.prod1a  
Use of uninitialized value at sort.pudge line 266.
[download]

file in question is 400, 987 records h2z.prod.400987aa with a record length of 103 and a sample record looks like this:

20040623CC255533          3130              268713110       3       20
+040623    
    8.96        1.000
[download]

when i run it like this i get a different error msg:

>perl -w  sort.pudge -d    h2z.prod.400987aa  > h2z.prod1a
Out of memory!
[download]

Is the program truly out of memory or is this a bug in the package or in the implementation or in the module itself?? This code which seems to be doing the same thing is only 10 lines long and can work on a file 3 times this size that is 90mb large.. when all other sort perl programs I could find run out of memory like this one:

"perl_sort.pl" 7 lines, 301 characters 
#/usr/bin/perl
open(MYINPUTFILE, "<h2z_zdl0.000"); # open="open" for="for" input="inp
+ut"
my(@lines)="<MYINPUTFILE">; # read file into list
@lines = sort(@lines);          # sort the list
my($line);foreach $line (@lines) # loop thru list
{   print "$line";                  # print in sort order
} close(MYINPUTFILE);
[download]

/opt/vrs/files/PROD_8 PBN2I]
>perl -w   perl_sort.pl 
Out of memory!
[download]

++++++++++++++++++++++++++++++++THIS PROGRAM WORKS THOUGH --- why?????

#!/usr/bin/perl
open (ORIGFILE,
"H2Z_ZDL0.000");
open (FINALFILE, ">sorted.txt");
print FINALFILE
sort(<origfile>);
close (FINALFILE);
close (ORIGFILE);
[download]

Thanks!
Andrew
andrew.mcgrath@pepsi.com
jun29th 2004.

The 2nd part regards an email I sent to Randal Schwartz that he answered but did not give me the fullest set of details I was looking for, please see below: Any more help would be greatly appreciated by me !!

>>>>> "Andrew" == McGrath, Andrew {PBG} <andrew.mcgrath@pepsi.com> writes:

Andrew> #!/usr/bin/perl
Andrew> open (ORIGFILE,
Andrew> "H2Z_ZDL0.000");
Andrew> open (FINALFILE, ">sorted.txt");
Andrew> print FINALFILE   <<-- what is this doing
Andrew> sort(<origfile>);  <<-- what is this doing
Andrew> close (FINALFILE);
Andrew> close (ORIGFILE);

perldoc of the sort function in readmore

$ perldoc -t -f sort
   sort SUBNAME LIST
   sort BLOCK LIST
   sort LIST
           In list context, this sorts the LIST and returns the sorted
+ list
           value. In scalar context, the behaviour of "sort()" is
           undefined.

           If SUBNAME or BLOCK is omitted, "sort"s in standard string
           comparison order. If SUBNAME is specified, it gives the nam
+e of
           a subroutine that returns an integer less than, equal to, o
+r
           greater than 0, depending on how the elements of the list a
+re to
           be ordered. (The "<=>" and "cmp" operators are extremely us
+eful
           in such routines.) SUBNAME may be a scalar variable name
           (unsubscripted), in which case the value provides the name 
+of
           (or a reference to) the actual subroutine to use. In place 
+of a
           SUBNAME, you can provide a BLOCK as an anonymous, in-line s
+ort
           subroutine.

           If the subroutine's prototype is "($$)", the elements to be
           compared are passed by reference in @_, as for a normal
           subroutine. This is slower than unprototyped subroutines, w
+here
           the elements to be compared are passed into the subroutine 
+as
           the package global variables $a and $b (see example below).
+ Note
           that in the latter case, it is usually counter-productive t
+o
           declare $a and $b as lexicals.

           In either case, the subroutine may not be recursive. The va
+lues
           to be compared are always passed by reference, so don't mod
+ify
           them.

           You also cannot exit out of the sort block or subroutine us
+ing
           any of the loop control operators described in perlsyn or w
+ith
           "goto".

           When "use locale" is in effect, "sort LIST" sorts LIST acco
+rding
           to the current collation locale. See perllocale.

           Perl 5.6 and earlier used a quicksort algorithm to implemen
+t
           sort. That algorithm was not stable, and *could* go quadrat
+ic.
           (A *stable* sort preserves the input order of elements that
           compare equal. Although quicksort's run time is O(NlogN) wh
+en
           averaged over all arrays of length N, the time can be O(N**
+2),
           *quadratic* behavior, for some inputs.) In 5.7, the quickso
+rt
           implementation was replaced with a stable mergesort algorit
+hm
           whose worst case behavior is O(NlogN). But benchmarks indic
+ated
           that for some inputs, on some platforms, the original quick
+sort
           was faster. 5.8 has a sort pragma for limited control of th
+e
           sort. Its rather blunt control of the underlying algorithm 
+may
           not persist into future perls, but the ability to character
+ize
           the input or output in implementation independent ways quit
+e
           probably will. See sort.

           Examples:

               # sort lexically
               @articles = sort @files;

               # same thing, but with explicit sort routine
               @articles = sort {$a cmp $b} @files;

               # now case-insensitively
               @articles = sort {uc($a) cmp uc($b)} @files;

               # same thing in reversed order
               @articles = sort {$b cmp $a} @files;

               # sort numerically ascending
               @articles = sort {$a <=> $b} @files;

               # sort numerically descending
               @articles = sort {$b <=> $a} @files;

               # this sorts the %age hash by value instead of key
               # using an in-line function
               @eldest = sort { $age{$b} <=> $age{$a} } keys %age;

               # sort using explicit subroutine name
               sub byage {
                   $age{$a} <=> $age{$b};  # presuming numeric
               }
               @sortedclass = sort byage @class;

               sub backwards { $b cmp $a }
               @harry  = qw(dog cat x Cain Abel);
               @george = qw(gone chased yz Punished Axed);
               print sort @harry;
                       # prints AbelCaincatdogx
               print sort backwards @harry;
                       # prints xdogcatCainAbel
               print sort @george, 'to', @harry;
                       # prints AbelAxedCainPunishedcatchaseddoggoneto
+xyz

               # inefficiently sort by descending numeric compare usin
+g
               # the first integer after the first = sign, or the
               # whole record case-insensitively otherwise

               @new = sort {
                   ($b =~ /=(\d+)/)[0] <=> ($a =~ /=(\d+)/)[0]
                                       ||
                               uc($a)  cmp  uc($b)
               } @old;

               # same thing, but much more efficiently;
               # we'll build auxiliary indices instead
               # for speed
               @nums = @caps = ();
               for (@old) {
                   push @nums, /=(\d+)/;
                   push @caps, uc($_);
               }

               @new = @old[ sort {
                                   $nums[$b] <=> $nums[$a]
                                            ||
                                   $caps[$a] cmp $caps[$b]
                                  } 0..$#old
                          ];

               # same thing, but without any temps
               @new = map { $_->[0] }
                      sort { $b->[1] <=> $a->[1]
                                      ||
                             $a->[2] cmp $b->[2]
                      } map { [$_, /=(\d+)/, uc($_)] } @old;

               # using a prototype allows you to use any comparison su
+broutine
               # as a sort subroutine (including other package's subro
+utines)
               package other;
               sub backwards ($$) { $_[1] cmp $_[0]; }     # $a and $b
+ are not \
set here

               package main;
               @new = sort other::backwards @old;

               # guarantee stability, regardless of algorithm
               use sort 'stable';
               @new = sort { substr($a, 3, 5) cmp substr($b, 3, 5) } @
+old;

               # force use of mergesort (not portable outside Perl 5.8
+)
               use sort '_mergesort';  # note discouraging _
               @new = sort { substr($a, 3, 5) cmp substr($b, 3, 5) } @
+old;

           If you're using strict, you *must not* declare $a and $b as
           lexicals. They are package globals. That means if you're in
+ the
           "main" package and type

               @articles = sort {$b <=> $a} @files;

           then $a and $b are $main::a and $main::b (or $::a and $::b)
+, but
           if you're in the "FooPack" package, it's the same as typing

               @articles = sort {$FooPack::b <=> $FooPack::a} @files;


           The comparison function is required to behave. If it return
+s
           inconsistent results (sometimes saying $x[1] is less than $
+x[2]
           and sometimes saying the opposite, for example) the results
+ are
           not well-defined.

           Because "<=>" returns "undef" when either operand is "NaN"
           (not-a-number), and because "sort" will trigger a fatal err
+or
           unless the result of a comparison is defined, when sorting 
+with
           a comparison function like "$a <=> $b", be careful about li
+sts
           that might contain a "NaN". The following example takes
           advantage of the fact that "NaN != NaN" to eliminate any "N
+aN"s
           from the input.

               @result = sort { $a <=> $b } grep { $_ == $_ } @input;
[download]

Andrew> How does perl know to default sort the whole record which is
Andrew> 103 chars in length in alphabetical order and how does it do
Andrew> so without a complicated sort function hand-written for the
Andrew> file in question to assist it??

Read "Learning Perl".

Andrew> IS this program using main memory or some other temp file
Andrew> trick or pipes ???  because it did not fail when sorting a
Andrew> 90mb file but I tried 4 other perl programs off the web and
Andrew> numerous combos of unix sort/exec/xarg and got nowhere at all
Andrew> !!!  Can I get a moment of your time and a simple set of
Andrew> answers here .... please.  Thank You Again!  Andrew McGrath

It's all in memory.

Any and all help, resources you could throw at this issue/help assist me ... I would forever be indebted and would not mind to assist you in one of your OSDN/Open Source/Perl Projects as fair payback. Thanks Again for your time and interest. Sincerely.

Andrew, perl coder in training. P.S. My 1st love in Unix and Korn shell but I am determined to rise to a high-level of proficiency with Perl before 2004 closes out on ,me in order to advance my career, knowledge, skills and most of all to save me time and make me more efficient and better at what I do .. which is being a 7X24Unix BackEnd Support Engineer for a 1.6 terabyte Oracle -Vendor driven Selling Database at PBG.

Andrew M. McGrath
CPS IT SUPPORT
PBG - SOMERS NY
"Always Moving Forward"

Edit by castaway - hide html headers, fixed up formatting, restore email content without signature, replace long lines of stars with hrs

Comment on How to sort a large flat file 90mb with PERL-- various ways/tradeoffs/watchouts Select or Download Code

Replies are listed 'Best First'.
Re: How to sort a large flat file 90mb with PERL-- various ways/tradeoffs/watchouts by delirium (Chaplain) on Jun 30, 2004 at 01:36 UTC
If you aren't able to modify @INC or get modules installed, and have a lot of data to sort, perl may not be the best answer for you. Do you have Unix on that AIX box? If so, I recommend sticking with the Unix sort program. A command like: `sort file.txt > file.sorted.txt` should do the trick. Unix sort is handy with large files, breaking them down to many smaller files and reassembling them as it moves. It uses a good balance of cpu, ram, and drive space, in my experience. I used to use it to help sift through data on some syslog servers at my old job, and actually replaced a production perl script with a Unix grep/sort one-liner due to the performance boost (over my sloppy perl script, that is).	[reply] [d/l]
Re: How to sort a large flat file 90mb with PERL-- various ways/tradeoffs/watchouts by tachyon (Chancellor) on Jun 30, 2004 at 02:44 UTC
I agree that native unix sort is probably the best idea. You can install modules see A Guide To Installing Modules for details on how to install locally for yourself. If you have memory to burn you could just slurp the file into an array and sort it using perl's sort: `#!/usr/bin/perl open F, $ARGV[0] or die "Usage $0 infile > outfile\n"; @ary = <F>; close F; print for sort @ary;` [download] This code will almost certainly be significantly slower and use a lot more memory than native unix sort. ~~Because we output the sorted results immediately we only have one array in memory.....~~ cheers tachyon	[reply] [d/l]
Re^2: How to sort a large flat file 90mb with PERL-- various ways/tradeoffs/watchouts by theorbtwo (Prior) on Jun 30, 2004 at 07:16 UTC
Actually, you still have two arrays in memory. `print for (sort @ary);` creates a new array to hold the results of the sort, and can't throw out the value of @ary (until after the script ends), because the variable @ary is still in scope. `@ary=sort @ary` doesn't have this problem, in theory, but up to somewhen in the 5.8.x series (5.8.2, IIRC), it created a temporary array anyway. In any case, yes, native sort is probably faster then perl sort for this.	[reply] [d/l] [select]
Re: How to sort a large flat file 90mb with PERL-- various ways/tradeoffs/watchouts by jarich (Curate) on Jun 30, 2004 at 09:47 UTC
First of all, welcome to Perl maxon11. You would be vastly assisted by reading the various tutorials on this site, and getting yourself a good book about Perl. Randal Schwartz's "Learning Perl" might be perfect. I particularly recommend that you read the On asking for help and How (Not) To Ask A Question nodes as I feel that your questions here could have benefitted by your being more brief. At the very least, you can probably depend on us having access to the sort documentation already. ;) Even if Randal did sent it to you. The uninitialized value warning from sort.pudge.pl is probably a minor bug. If that's all it gave you however, then I wouldn't worry too much. It looks like it otherwise sorted your file. In the second case - yes you probably ran out of memory. How much memory are your programs allowed to take up? If you're on a Unix-like operating system then you can usually type in "ulimit" on the command line and find out. Mind you, if you're using a Unix-like operating system then you should probably use the unix sort. :) The difference between the two sort code listings that you provide is that the first makes several copies of the data in memory whereas the second does not. To explain how the second program works, I'll reformat it and add in some comments. I've also made some slight changes to make it a better program generally. #!/usr/bin/perl my $input = "H2Z_ZDL0.000"; my $ouput = "sorted.txt"; # Open $input for reading. open(ORIGFILE, "<", $input) or die "Could not open $input: $!"; # Open $output for writing, destroying current file contents open (FINALFILE, ">", $output) or die "Could not open $output: $!"; # This line does several things. It reads all the lines # from ORIGFILE into memory (which is done in the <ORIGFILE> # bit), sorts them (using sort) and then prints them out # to the file in FINALFILE. print FINALFILE sort(<ORIGFILE>); # close file in FINALFILE, flushes buffer close (FINALFILE); # close file in ORIGFILE close (ORIGFILE); [download] You ask how Perl knows to default sort the whole record in alphabetical order. This is answered right up the top of the sort documentation: If SUBNAME or BLOCK is omitted, "sort"s in standard string comparison order. That is, if you write `sort @array` then sort will sort alphabetically. I would presume that you actually want it to sort numerically. You can do this by writing: `sort { $a <=> $b} @array` just like it says in the documentation. Good luck with learing Perl. Hope this helps jarich	[reply] [d/l] [select]