in reply to What is the most memory efficient way to (sort) and print a hash?

Maybe a merge-sort?
#!/usr/bin/perl -w #by david #with 20million rows, you probably don't want to store #everything in memory and then sort them. what you have to do is sort +the #data file segment by segment and then merge them back. merging is the + real #tricky business. the following script(which i did for someone a while + ago) #will do that for you. what it does is break the file into multiple ch +unks #of 100000 lines, sort the chunks in a disk tmp file and then merge al +l the #chunks back together. when i sort the file, i keep the smallest bound +ary #ofeach chunk and use this number to sort the file so you don't have t +o #compare all the tmp files. #there is also a merge sort in the PPT Perl Power Tools on cpan use strict; my @buffer = (); my @tmps = (); my %bounds = (); my $counter = 0; open( FILE, "file.txt" ) || die $!; while (<FILE>) { push ( @buffer, $_ ); if ( @buffer > 100000 ) { my $tmp = "tmp" . $counter++ . ".txt"; push ( @tmps, $tmp ); sort_it( \@buffer, $tmp ); @buffer = (); } } close(FILE); merge_it( \%bounds ); unlink(@tmps); #-- DONE --# sub sort_it { my $ref = shift; my $tmp = shift; my $first = 1; open( TMP, ">$tmp" ) || die $!; for ( sort { my @fields1 = split ( /\s/, $a ); my @fields2 = split ( /\s/, $b ); $fields1[2] <=> $fields2[2] } @{$ref} ) { if ($first) { $bounds{$tmp} = ( split (/\s/) )[2]; $first = 0; } print TMP $_; } close(TMP); } sub merge_it { my $ref = shift; my @files = sort { $ref->{$a} <=> $ref->{$b} } keys %{$ref}; my $merged_to = $files[0]; for ( my $i = 1 ; $i < @files ; $i++ ) { open( FIRST, $merged_to ) || dir $!; open( SECOND, $files[$i] ) || dir $!; my $merged_tmp = "merged_tmp$i.txt"; open( MERGED, ">$merged_tmp" ) || die $!; my $line1 = <FIRST>; my $line2 = <SECOND>; while (1) { if ( !defined($line1) && defined($line2) ) { print MERGED $line2; print MERGED while (<SECOND>); last; } if ( !defined($line2) && defined($line1) ) { print MERGED $line1; print MERGED while (<FIRST>); last; } last if ( !defined($line1) && !defined($line2) ); my $value1 = ( split ( /\s/, $line1 ) )[2]; my $value2 = ( split ( /\s/, $line2 ) )[2]; if ( $value1 == $value2 ) { print MERGED $line1; print MERGED $line2; $line1 = <FIRST>; $line2 = <SECOND>; } elsif ( $value1 > $value2 ) { while ( $value1 > $value2 ) { print MERGED $line2; $line2 = <SECOND>; last unless ( defined $line2 ); $value2 = ( split ( /\s/, $line2 ) )[2]; } } else { while ( $value1 < $value2 ) { print MERGED $line1; $line1 = <FIRST>; last unless ( defined $line1 ); $value1 = ( split ( /\s/, $line1 ) )[2]; } } } close(FIRST); close(SECOND); close(MERGED); $merged_to = $merged_tmp; } }

I'm not really a human, but I play one on earth.
Old Perl Programmer Haiku ................... flash japh
  • Comment on Re: What is the most memory efficient way to (sort) and print a hash?
  • Download Code