in reply to Re^5: Sort big text file - byte offset - 50% there (Added code)
in thread Sort big text file - byte offset - 50% there

Here is the output of the script including the warning. Good point, sorry
Begin Index
Indexed 1167064 Lines in 00 Days, 00 Hours, 00 Minutes, 20 Seconds
Begin Sort
Sorted 1167064 Lines in 00 Days, 00 Hours, 00 Minutes, 00 Seconds
1132616958:393264867 at D:/Logs/biglog.unsorted.log line 32, <BIGLOG> line 1168107.
Total runtime:          00 Days, 00 Hours, 00 Minutes, 47 Seconds
  • Comment on Re^6: Sort big text file - byte offset - 50% there (Added code)

Replies are listed 'Best First'.
Re^7: Sort big text file - byte offset - 50% there (Added code)
by BrowserUk (Patriarch) on Aug 14, 2006 at 20:22 UTC

    I think the problem is in your indexing loop (and goes right back to your OP).

    my @index; while (<BIGLOG>){ my $offset = tell BIGLOG; ### This offset is the start of the *nex +t* line! my $epoch = ( /^\s*#/ or /^\s\n/ or $_ !~ /^\s*\d/ ) ? 0 : Mktime( unpack 'A4xA2xA2xA2xA2xA2', $_ ); push @index, pack 'NN', $epoch, $offset; }

    You read a line, record the file position, and then pair that file position with the epoch info from the line you read. But that offset is the start of the next line, not the one you just read. The result is that all the offsets are one line displaced, so that when you come to try and read, having seek'd to the last offset (which is end of file), there is nothing left to read, so it fails.

    You need to recast that loop something like this:

    my( $offset, @index ) = 0; ## The first lines offset is zero while (<BIGLOG>){ my $epoch = ( /^\s*#/ or /^\s\n/ or $_ !~ /^\s*\d/ ) ? 0 : Mktime( unpack 'A4xA2xA2xA2xA2xA2', $_ ); push @index, pack 'NN', $epoch, $offset; ## Pair with previous off +set $offset = tell BIGLOG; ## and now get the start of the next line + }

    You probably should be checking the return code from seek also.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Thank you very much. I was so focused on getting the pack and sort working correctly that I completely overlooked that loop. For anyone who might be interested in doing something similar in the future, here is a stripped down version that works.
      Once again, thanks everyone.
      #!/usr/bin/perl -w use strict; use Date::Calc qw(Mktime); open (BIGLOG, "< D:/Logs/biglog.unsorted.log") || die "Cannot open unsorted log $!"; my( $offset, @index ) = 0; while (<BIGLOG>){ my $epoch = ( /^\s*#/ or /^\s\n/ or $_ !~ /^\s*\d/ ) ? 0 : Mktime( unpack 'A4xA2xA2xA2xA2xA2', $_ ); push @index, pack 'NN', $epoch, $offset; $offset = tell BIGLOG; } @index = sort {$a cmp $b} @index; open (OUTFILE, "> D:/Logs/biglog.sorted.log") || die "Cannot write sor +ted log $!"; while (@index){ print OUTFILE readline_n(\*BIGLOG, shift @index); } close BIGLOG; close OUTFILE; exit; sub readline_n{ my( $fh, $line) = @_; seek ($fh, unpack( 'N',substr( $line, 4, 4 )), 0) || warn "Problem + seeking to $line $!\n"; scalar <$fh> }
        Ok, I hit a snag. I am working on sorting a big 4.42 GB (4,757,131,264 bytes) 13961993 line file. When the sorted file gets written, it's always smaller than the original file. This only happens on these huge files. My thinking is that the offset is creating a number larger than what pack N can deal with.

        push @index, pack 'NN', $epoch, $offset;
        Can someone confirm this and/or let me know if there is another way? Thanks

        <edit> Ok, I just figured out that my guess was correct.

        #!/usr/bin/perl -w use strict; my $num = 4757131264; print "NUM: ".$num."\n"; my $pnum = pack 'N', $num; print "NUM: ".unpack( 'N',$pnum)."\n";
        How else can I pack the data and maintain the fast sort? My system does not support 64-bit
        </edit>