comment on

Maybe a merge-sort?

#!/usr/bin/perl -w

#by david

#with 20million rows, you probably don't want to store
#everything in memory and then sort them. what you have to do is sort 
+the
#data file segment by segment and then merge them back. merging is the
+ real
#tricky business. the following script(which i did for someone a while
+ ago)
#will do that for you. what it does is break the file into multiple ch
+unks
#of 100000 lines, sort the chunks in a disk tmp file and then merge al
+l the
#chunks back together. when i sort the file, i keep the smallest bound
+ary
#ofeach chunk and use this number to sort the file so you don't have t
+o
#compare all the tmp files.
#there is also a merge sort in the PPT Perl Power Tools on cpan
use strict;
my @buffer  = ();
my @tmps    = ();
my %bounds  = ();
my $counter = 0;
open( FILE, "file.txt" ) || die $!;
while (<FILE>) {
    push ( @buffer, $_ );
    if ( @buffer > 100000 ) {
        my $tmp = "tmp" . $counter++ . ".txt";
        push ( @tmps, $tmp );
        sort_it( \@buffer, $tmp );
        @buffer = ();
    }
}
close(FILE);
merge_it( \%bounds );
unlink(@tmps);

#-- DONE --#
sub sort_it {
    my $ref   = shift;
    my $tmp   = shift;
    my $first = 1;
    open( TMP, ">$tmp" ) || die $!;
    for (
        sort {
            my @fields1 = split ( /\s/, $a );
            my @fields2 = split ( /\s/, $b );
            $fields1[2] <=> $fields2[2]
        } @{$ref}
      )
    {
        if ($first) {
            $bounds{$tmp} = ( split (/\s/) )[2];
            $first = 0;
        }
        print TMP $_;
    }
    close(TMP);
}

sub merge_it {
    my $ref       = shift;
    my @files     = sort { $ref->{$a} <=> $ref->{$b} } keys %{$ref};
    my $merged_to = $files[0];
    for ( my $i = 1 ; $i < @files ; $i++ ) {
        open( FIRST, $merged_to ) || dir $!;
        open( SECOND, $files[$i] ) || dir $!;
        my $merged_tmp = "merged_tmp$i.txt";
        open( MERGED, ">$merged_tmp" ) || die $!;
        my $line1 = <FIRST>;
        my $line2 = <SECOND>;
        while (1) {
            if ( !defined($line1) && defined($line2) ) {
                print MERGED $line2;
                print MERGED while (<SECOND>);
                last;
            }
            if ( !defined($line2) && defined($line1) ) {
                print MERGED $line1;
                print MERGED while (<FIRST>);
                last;
            }
            last if ( !defined($line1) && !defined($line2) );
            my $value1 = ( split ( /\s/, $line1 ) )[2];
            my $value2 = ( split ( /\s/, $line2 ) )[2];
            if ( $value1 == $value2 ) {
                print MERGED $line1;
                print MERGED $line2;
                $line1 = <FIRST>;
                $line2 = <SECOND>;
            }
            elsif ( $value1 > $value2 ) {
                while ( $value1 > $value2 ) {
                    print MERGED $line2;
                    $line2 = <SECOND>;
                    last unless ( defined $line2 );
                    $value2 = ( split ( /\s/, $line2 ) )[2];
                }
            }
            else {
                while ( $value1 < $value2 ) {
                    print MERGED $line1;
                    $line1 = <FIRST>;
                    last unless ( defined $line1 );
                    $value1 = ( split ( /\s/, $line1 ) )[2];
                }
            }
        }
        close(FIRST);
        close(SECOND);
        close(MERGED);
        $merged_to = $merged_tmp;
    }
}
[download]

I'm not really a human, but I play one on earth.
Old Perl Programmer Haiku ................... flash japh

In reply to Re: What is the most memory efficient way to (sort) and print a hash? by zentara
in thread What is the most memory efficient way to (sort) and print a hash? by a_salway

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.