in reply to Re^7: Sort big text file - byte offset - 50% there (Added code)
in thread Sort big text file - byte offset - 50% there

Thank you very much. I was so focused on getting the pack and sort working correctly that I completely overlooked that loop. For anyone who might be interested in doing something similar in the future, here is a stripped down version that works.
Once again, thanks everyone.
#!/usr/bin/perl -w use strict; use Date::Calc qw(Mktime); open (BIGLOG, "< D:/Logs/biglog.unsorted.log") || die "Cannot open unsorted log $!"; my( $offset, @index ) = 0; while (<BIGLOG>){ my $epoch = ( /^\s*#/ or /^\s\n/ or $_ !~ /^\s*\d/ ) ? 0 : Mktime( unpack 'A4xA2xA2xA2xA2xA2', $_ ); push @index, pack 'NN', $epoch, $offset; $offset = tell BIGLOG; } @index = sort {$a cmp $b} @index; open (OUTFILE, "> D:/Logs/biglog.sorted.log") || die "Cannot write sor +ted log $!"; while (@index){ print OUTFILE readline_n(\*BIGLOG, shift @index); } close BIGLOG; close OUTFILE; exit; sub readline_n{ my( $fh, $line) = @_; seek ($fh, unpack( 'N',substr( $line, 4, 4 )), 0) || warn "Problem + seeking to $line $!\n"; scalar <$fh> }

Replies are listed 'Best First'.
Re^9: Sort big text file - byte offset
by msalerno (Beadle) on Aug 31, 2006 at 21:05 UTC
    Ok, I hit a snag. I am working on sorting a big 4.42 GB (4,757,131,264 bytes) 13961993 line file. When the sorted file gets written, it's always smaller than the original file. This only happens on these huge files. My thinking is that the offset is creating a number larger than what pack N can deal with.

    push @index, pack 'NN', $epoch, $offset;
    Can someone confirm this and/or let me know if there is another way? Thanks

    <edit> Ok, I just figured out that my guess was correct.

    #!/usr/bin/perl -w use strict; my $num = 4757131264; print "NUM: ".$num."\n"; my $pnum = pack 'N', $num; print "NUM: ".unpack( 'N',$pnum)."\n";
    How else can I pack the data and maintain the fast sort? My system does not support 64-bit
    </edit>

      Use 'd', it will handle integers upto 253 without loss of accuracy which means your set for files upto ~8,000,000 GB, which should give you room for a little future growth. Be aware that your index will require more memory.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      It was pointed out that whilst using 'd' to pack your offsets will work, when it comes to doing the sorting, you would have to unpack the values and do a numeric comparison as sorting numeric data in it's pack'd binary for doesn't work for packed floats/doubles.

      Having previously extolled the virtues of sorting numeric data in it's binary form, that doesn't sit right, so here are a couple of routines that will pack and unpack a FP value < 2**53 to an 8-byte binary form that is sortable. along with some rudimentary tests:

      #! perl -slw use strict; use Math::Random::MT qw[ rand ]; sub ftob64 { return pack 'NN', int( $_[ 0 ] / 2**32 ), int( $_[ 0 ] % 2**32 ); } sub b64tof { my( $hi, $lo ) = unpack 'NN', $_[ 0 ]; return $hi * 2**32 + $lo; } for ( 1 .. 1e6 ) { my $test = int( rand 2**53 ); my $b64 = ftob64 $test; my $float = b64tof $b64; if( abs( $test - $float ) > 1e-15 ) { printf "%31.f v %31.f => diff %31.31f\n", $test, $float, abs( $test - $float ); } } my @randomBin = map{ ftob64 int rand 2**53 } 1 .. 1e6; my @sortedBin = sort @randomBin; my @sortedN = map{ b64tof $_ } @sortedBin; $sortedN[ $_ ] > $sortedN[ $_ + 1 ] and die "Error: $_ : $sortedN[ $_ ] > $sortedN[ $_ + 1 ]" for 0 .. $#sortedN - 1;

      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.