in reply to Re^3: Sort big text file - byte offset - 50% there (Added code)
in thread Sort big text file - byte offset - 50% there

No joke about that sort! It flies. I am only having one problem with the script. For some reason, it throws a warning while printing the sorted log. According to the output of warn, its always on the last line of the unsorted log.

One more question, it's about vec. I have not been able to find any good guides on using it. If I implemented it, would I be able to reduce my memory footprint? Does anyone have a good link with a explanation of the function? Something like perlpacktut but for vec? Thanks again. Here is the updated script with the new sort and old warning.

#!/usr/bin/perl -w use strict; use Date::Calc qw(Mktime Today_and_Now Delta_DHMS); my @starttime = Today_and_Now; print "Begin Index\n"; open (BIGLOG, "< D:/Logs/biglog.unsorted.log") || die "Cannot open log\n"; my @index; while (<BIGLOG>){ my $offset = tell BIGLOG; my $epoch = ( /^\s*#/ or /^\s\n/ or $_ !~ /^\s*\d/ ) ? 0 : Mktime( unpack 'A4xA2xA2xA2xA2xA2', $_ ); push @index, pack 'NN', $epoch, $offset; } print "\nIndexed ". @index ." Lines in "; printf "%02d Days, %02d Hours, %02d Minutes, %02d Seconds\n", Delta_DH +MS( @starttime, Today_and_Now ); print "Begin Sort\n"; my @startsort = Today_and_Now; @index = sort {$a cmp $b} @index; print "Sorted ". @index ." Lines in "; printf "%02d Days, %02d Hours, %02d Minutes, %02d Seconds\n", Delta_DH +MS( @startsort, Today_and_Now ); open (OUTFILE, "> D:/Logs/biglog.sorted.log") || die; foreach (@index){ my $byte = unpack( 'N',substr( $_, 4, 4 )); print OUTFILE readline_n(\*BIGLOG, $byte ) || warn unpack( 'N', substr( $_, 0, 4 )).":".unpack( 'N',substr( +$_, 4, 4 )).$!; } close OUTFILE; close BIGLOG; print "\nTotal runtime:\t\t"; printf "%02d Days, %02d Hours, %02d Minutes, %02d Seconds\n", Delta_DH +MS( @starttime, Today_and_Now); exit; sub readline_n{ my( $fh, $line) = @_; seek $fh, $line, 0; scalar <$fh> }

Replies are listed 'Best First'.
Re^5: Sort big text file - byte offset - 50% there (Added code)
by BrowserUk (Patriarch) on Aug 14, 2006 at 15:32 UTC
    For some reason, it throws a warning while printing the sorted log. According to the output of warn, its always on the last line of the unsorted log.

    It's generally a good idea to include a cut&paste of the warning message!

    I don't see any potential for using vec to save memory here.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Here is the output of the script including the warning. Good point, sorry
      Begin Index
      Indexed 1167064 Lines in 00 Days, 00 Hours, 00 Minutes, 20 Seconds
      Begin Sort
      Sorted 1167064 Lines in 00 Days, 00 Hours, 00 Minutes, 00 Seconds
      1132616958:393264867 at D:/Logs/biglog.unsorted.log line 32, <BIGLOG> line 1168107.
      Total runtime:          00 Days, 00 Hours, 00 Minutes, 47 Seconds
      

        I think the problem is in your indexing loop (and goes right back to your OP).

        my @index; while (<BIGLOG>){ my $offset = tell BIGLOG; ### This offset is the start of the *nex +t* line! my $epoch = ( /^\s*#/ or /^\s\n/ or $_ !~ /^\s*\d/ ) ? 0 : Mktime( unpack 'A4xA2xA2xA2xA2xA2', $_ ); push @index, pack 'NN', $epoch, $offset; }

        You read a line, record the file position, and then pair that file position with the epoch info from the line you read. But that offset is the start of the next line, not the one you just read. The result is that all the offsets are one line displaced, so that when you come to try and read, having seek'd to the last offset (which is end of file), there is nothing left to read, so it fails.

        You need to recast that loop something like this:

        my( $offset, @index ) = 0; ## The first lines offset is zero while (<BIGLOG>){ my $epoch = ( /^\s*#/ or /^\s\n/ or $_ !~ /^\s*\d/ ) ? 0 : Mktime( unpack 'A4xA2xA2xA2xA2xA2', $_ ); push @index, pack 'NN', $epoch, $offset; ## Pair with previous off +set $offset = tell BIGLOG; ## and now get the start of the next line + }

        You probably should be checking the return code from seek also.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.