Sort large files

ramish has asked for the wisdom of the Perl Monks concerning the following question:

I posted a question on EBCDIC sort earlier node id=616606 and got many replies. I tried one code and it worked but I am facing another problem.

I am sorting a file that it more than 400MB size and using multiple keys. While executing, the sort fails with core dump and sometimes illegal instruction message.

I did some investigation and found out that it is due to insufficient memory.

I am running it on AIX server and when I did ulimit -a it gave memory as 65536 bytes. I have unlimited file size permission. I reduced the size of the input file to about 60 K and the sort worked. I have checked and I can't use malloc() or reset it to be used during runtime.

   use strict;
   use Encode qw(encode decode);
   ### Define the sort key here ###
   # Sorts in ascending order.
    sub key1  { ( substr( $a, 3, 17 )) cmp ( substr( $b, 3, 17 )); }
     # Sorts  descending order.
    sub key2  { ( substr( $b, 20, 2 )) cmp ( substr( $a, 20, 2 )); }
    #
    ### Sort processing starts ###
    my @infile = <>; # Reads file
    ### Multiple sort keys can be defined and sorted in the order of t
+he key
    my @sorted = map { decode('cp1047', $_) } sort { key1 || key2 } (m
+ap
    { encode('cp1047', $_) } @infile);
    print @sorted;
[download]

I got another snippet of code from earlier post on the same topic for sorting large files. The code is

#!/usr/bin/perl -sw
     use vars qw/$N/;
    use strict;
     use sort "stable";
     use Encode qw(encode decode);
    no strict 'refs';
     $|++;
    sub key1  { ( substr( $a, 3, 17 )) cmp ( substr( $b, 3, 17 )); }
    # Sorts  in descending order.
    sub key2  { ( substr( $b, 20, 2 )) cmp ( substr( $a, 20, 2 )); }
    
    my $reclen = 8072; #! Adjust to suit your records/line ends.
    $N = $N || 1;
    
    warn "Usage: $0 [-N=n] file\n" and exit(-1) unless @ARGV;
    
   warn "Reading input file $ARGV[0] ", -s $ARGV[0],  "\n";
    
    if ( not defined $ARGV[1] ) {
         warn "Output file not specified a Continue[N|y]?";
  exit -1 if <STDIN> !~ /^Y/i;
    }
   
      $/= \$reclen;
    
  open INPUT, '<', $ARGV[0] or die $!, $ARGV[0];
    binmode(INPUT);
   
    my (@fhs);
     while ( <INPUT> ) {
        my $key = substr($_, 3, $::N);
      if (not defined $fhs[$key]) {
             $fhs[$key] = "temp.$key";
             warn( "\rCreating file: $fhs[$key] ");
             open( $fhs[$key], ">$fhs[$key]")
                 or die( "Could create $fhs[$key]: $!");
             binmode($fhs[$key]);
         }
         print {$fhs[$key]} $_;
      }
      #! Get rid of unused filehandles or those that reference zero le
+ngth file
      @fhs = grep{ $_ and ! -z $_} @fhs;

      close $_ for @fhs;
      close INPUT;
    
     warn "Split made to: ", scalar @fhs, " files\n";
    
      #! Sort the split files on the first & second field
      for my $fh (@fhs) {
         warn "$fh: reading;...";
         open $fh, "<$fh" or die $!;
         binmode($fh);
          my @recs = <$fh>;
         close $fh;
    
          warn " sorting: ", scalar @recs, " recs;...";
        # @recs = sort{ substr($a, 3, 16)  cmp substr($b, 3, 16)
        #         ||    substr($b, 20, 3) cmp substr($a, 20, 3) } @rec
+s;
         my @recs = map { decode('cp1047', $_) } sort { key1 || key2 }
           (map { encode('cp1047', $_) } @recs);
   
         warn " writing;...";
         open $fh, ">$fh" or die $!;
         binmode($fh);
          print $fh @recs;
          close $fh;
    
          warn "done;\n";
    }
    
     warn "Merging files: ";
      *SORTED = *STDOUT;
      open SORTED, '>', $ARGV[1] and binmode(SORTED) or die $! if $ARG
+V[1];
      for my $fh (reverse @fhs) {
          warn " $fh;";
         open $fh, "<$fh" and binmode($fh) or die $!;
         print SORTED <$fh>;
          close $fh;
      }
      warn "\nClosing sorted file: sorted\n";
      close SORTED;
      warn "Deleting temp files\n";
      unlink $_ or warn "Couldn't unlink $_\n" for @fhs;
      warn "Done.\n";
   
   exit (0);
[download]

I no longer face memory problem but the it outputs less number of records and the sort seems not to be working

A couple of sample record enclosed

SK 1242 0180010100 AAR CPH AAR 0735 CPH 0810001 20070521200705211 SK 1242

SK 1242 0190010100 AAR CPH AAR 0735 CPH 0810001 2007052699999999 6 SK 1242

Comment on Sort large files Select or Download Code

Replies are listed 'Best First'.
Re: Sort large files by shmem (Chancellor) on Jun 06, 2007 at 13:28 UTC
Hmm. You are slurping the entire file? `my @infile = <>; # Reads file` [download] And you sort only by 17 resp. 2 bytes? `### Define the sort key here ### # Sorts in ascending order. sub key1 { ( substr( $a, 3, 17 )) cmp ( substr( $b, 3, 17 )); } # Sorts descending order. sub key2 { ( substr( $b, 20, 2 )) cmp ( substr( $a, 20, 2 )); } #` [download] doing two `substr()` calls for each comparison? Heck, that's inefficient. You could calculate the comparison data for each record once, use that as key and store the records in a DB_File of type DB_BTREE. Then you iterate over that DB_BTREE hash with each, output the value for each tuple - and voilà - you get your records in sorted order. Memory requirements should be minimal, and DB_File is pretty fast. --shmem _($_=" "x(1<<5)."?\n".q·/)Oo. G°\ / /\_¯/(q / ---------------------------- \__(m.====·.(_("always off the crowd"))."· ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}	[reply] [d/l] [select]
Re: Sort large files by Moron (Curate) on Jun 06, 2007 at 11:51 UTC
Reducing the filesize sounds dubious if the 400 MB is realistic for the requirement. If a relational database isn't available as an intermediate indexed repository, in which case I'd use that plus DBI, there is always the C-ISAM file as an intermediate indexed file for AIX. That has meximum 8MTb indexed filesize which can be written unsorted and read back sorted by multiple keys with Perl using isam or cisam. Actually, I'd go the C-ISAM route anyway for 400 MB if possible because it functions more quickly than most relational solutions. __________________________________________________________________________________ ^M Free your mind!	[reply]
Re : Sort large files by ramish (Initiate) on Jun 06, 2007 at 15:01 UTC
Thanks for the suggestions. I will try out C-ISAM route but for time being, I have reverted back to AIX COBOL to do the sort. Actually , we are converting MVS COBOL to AIX and I thought I can use PERL.	[reply]
Re: Sort large files by zentara (Cardinal) on Jun 06, 2007 at 12:15 UTC
To be honest I'm not a sorting expert, but here is a script posted awhile back, that seems to address your problem. Your mention of losing records, seems to be addressed by the boundary in the script. Not tested, and posted only as a longshot. #!/usr/bin/perl -w =head1 by david with 20million rows, you probably don't want to store everything in memory and then sort them. what you have to do is sort the data file segment by segment and then merge them back. merging is the real tricky business. the following script (which i did for someone a while ago) will do that for you. what it does is break the file into multiple chunks of 100000 lines, sort the chunks in a disk tmp file and then merge all the chunks back together. when i sort the file, i keep the smallest boundary ofeach chunk and use this number to sort the file so you don't have to compare all the tmp files. there is also a merge sort in the PPT Perl Power Tools on cpan =cut use strict; my @buffer = (); my @tmps = (); my %bounds = (); my $counter = 0; open( FILE, "file.txt" ) \|\| die $!; while (<FILE>) { push ( @buffer, $_ ); if ( @buffer > 100000 ) { my $tmp = "tmp" . $counter++ . ".txt"; push ( @tmps, $tmp ); sort_it( \@buffer, $tmp ); @buffer = (); } } close(FILE); merge_it( \%bounds ); unlink(@tmps); #-- DONE --# sub sort_it { my $ref = shift; my $tmp = shift; my $first = 1; open( TMP, ">$tmp" ) \|\| die $!; for ( sort { my @fields1 = split ( /\s/, $a ); my @fields2 = split ( /\s/, $b ); $fields1[2] <=> $fields2[2] } @{$ref} ) { if ($first) { $bounds{$tmp} = ( split (/\s/) )[2]; $first = 0; } print TMP $_; } close(TMP); } sub merge_it { my $ref = shift; my @files = sort { $ref->{$a} <=> $ref->{$b} } keys %{$ref}; my $merged_to = $files[0]; for ( my $i = 1 ; $i < @files ; $i++ ) { open( FIRST, $merged_to ) \|\| dir $!; open( SECOND, $files[$i] ) \|\| dir $!; my $merged_tmp = "merged_tmp$i.txt"; open( MERGED, ">$merged_tmp" ) \|\| die $!; my $line1 = <FIRST>; my $line2 = <SECOND>; while (1) { if ( !defined($line1) && defined($line2) ) { print MERGED $line2; print MERGED while (<SECOND>); last; } if ( !defined($line2) && defined($line1) ) { print MERGED $line1; print MERGED while (<FIRST>); last; } last if ( !defined($line1) && !defined($line2) ); my $value1 = ( split ( /\s/, $line1 ) )[2]; my $value2 = ( split ( /\s/, $line2 ) )[2]; if ( $value1 == $value2 ) { print MERGED $line1; print MERGED $line2; $line1 = <FIRST>; $line2 = <SECOND>; } elsif ( $value1 > $value2 ) { while ( $value1 > $value2 ) { print MERGED $line2; $line2 = <SECOND>; last unless ( defined $line2 ); $value2 = ( split ( /\s/, $line2 ) )[2]; } } else { while ( $value1 < $value2 ) { print MERGED $line1; $line1 = <FIRST>; last unless ( defined $line1 ); $value1 = ( split ( /\s/, $line1 ) )[2]; } } } close(FIRST); close(SECOND); close(MERGED); $merged_to = $merged_tmp; } } [download] I'm not really a human, but I play one on earth. Cogito ergo sum a bum	[reply] [d/l]
Re: Sort large files by salva (Canon) on Jun 06, 2007 at 12:28 UTC
There are several CPAN modules that can help you solve the problem, for instance Sort::External or combining Sort::Key and Sort::Key::Merger.	[reply]
Re^2: Sort large files by Moron (Curate) on Jun 06, 2007 at 12:55 UTC
What I am not sure about is the maximum size. Sort::Maker appeared to be the closest match to the OP needs of the namespace you refer to, but even that didn't make it clear enough to me what the filesize limits are for the OP's particular key definitions for a particular CPAN module. The OP is partly at fault there for not giving that much info in regard to the distribution of information - two example records ain't much to go on. __________________________________________________________________________________ ^M Free your mind!	[reply]
Re^3: Sort large files by salva (Canon) on Jun 06, 2007 at 15:19 UTC
Sort::Maker memory requirements are quite high so I don't think it could be a good solution. Sort::External does the sorting on disk, so it is not limited by the memory size and it is very easy to use for simple cases but for complex cases a transformation similar to the GRT is required. Another way is to use the external `sort` program that uses on-disk sorting algorithms: `# untested ... use Encode qw(encode decode); use MIME::Base64 qw(encode_base64); my $tempfn = "sort.temp"; # better use File::Temp! open my $tmp, ">", $tempfn or die ...; while(<>) { my $k0 = encode_base64(encode(cp1047 => substr($_, 3, 17)), ""); my $k1 = encode_base64("\xff\xff" ^ encode(cp1047 => substr($_, 20, +2))); print $tmp join("\0", $k0, $k1, $_); } close $tmp or die "..."; open my $sorted, "-\|", sort => $tempfn or die "..."; while(<$sorted>) { print((split /\x00/, $_, 3)[2]); }` [download]	[reply] [d/l] [select]
Re^4: Sort large files by Moron (Curate) on Jun 06, 2007 at 17:57 UTC
Re: Sort large files by andreas1234567 (Vicar) on Jun 06, 2007 at 13:42 UTC
I suggest you try using a relational database to do the job for you. I'm not claiming that this will work for datasets as large as yours, but give it a try. MySQL sample code: CREATE DATABASE sorter; USE sorter; CREATE TABLE t619575 ( c01 varchar(2), -- SK c02 integer, -- 1242 c03 integer, -- 0180010100 c04 varchar(3), -- AAR c05 varchar(3), -- CPH c06 varchar(3), -- AAR c07 integer, -- 0735 c08 varchar(3), -- CPH c09 integer, -- 0810001 c10 bigint(25), -- 20070521200705211 c11 varchar(2), -- SK c12 integer -- 1242 ); -- Assuming tab separated columns LOAD DATA LOCAL INFILE '619575.csv' INTO TABLE t619575 FIELDS TERMINATED BY '\t' ENCLOSED BY '' ESCAPED BY '\\'; -- Add index on the fields to sort ALTER TABLE t619575 ADD INDEX idx_c02_c03_c04 (c02, c03, c04); -- Create sorted output SELECT * INTO OUTFILE '619575.sorted.csv' FROM t619575 ORDER BY c02, c03, c04 ; [download] Some may claim that MySQL isn't up for this kind of job. If it doesn't, then you might consider downloading the free (as in gratis) versions of Oracle or DB2. -- print map{chr}unpack(q{A3}x24,q{074117115116032097110111116104101114032080101114108032104097099107101114})	[reply] [d/l]
Re: Sort large files (rep) by tye (Sage) on Jun 06, 2007 at 14:56 UTC
Did you read Re^2: EBCDIC sort (/bin/sort) ? I posted it after you asked me to write up my suggestion from the chatterbox. If there was something about it that you had problems with, it would be polite to mention those. - tye	[reply]
Re: Sort large files by swampyankee (Parson) on Jun 06, 2007 at 12:23 UTC
You may be better using an external sorting routine, such as syncsort, or possibly even sort. emc Any New York City or Connecticut area jobs? I'm currently unemployed. There are some enterprises in which a careful disorderliness is the true method. —Herman Melville	[reply]
Re: Sort large files by Anonymous Monk on Jun 06, 2007 at 13:14 UTC
if you're using keys, you'll want a rdbms :)	[reply]

print map{chr}unpack(q{A3}x24,q{074117115116032097110111116104101114032080101114108032104097099107101114})