in reply to Re: Sort big text file - byte offset - 50% there (Added code)
in thread Sort big text file - byte offset - 50% there

Thanks, just as I was writing in an array, I noticed your post. The only thing I am stumped on at this point is the sort. The sort should be on the unpacked epoch value. Something like:
@index = sort unpacked @index; sub unpacked { $a = unpack( 'V', substr( $a, 0, 4 )); $b = unpack( 'V', substr( $b, 0, 4 )); $a <=> $b; };
The only problem, is that the data that's in @index after the sort is all messed up. It contains numbers that have nothing to do with the epoch or offset.

Replies are listed 'Best First'.
Re^3: Sort big text file - byte offset - 50% there (Added code)
by BrowserUk (Patriarch) on Aug 11, 2006 at 21:55 UTC
    The sort should be on the unpacked epoch value.

    No. A big advantage is that your packed binary epock dates should sort perfectly well without being unpacked provided that you use an alphasort (eg. cmp) and not numeric (<=>). And they will sort faster. This is the basis of the Guttman-Rosman Transform (GRT) sort.

    To convince you of this, look at the binary representation of the following "epochs". Remembering that I am running on a little-endian machine so the byte ordering is reversed, each (numerically) bigger number is represented by a alphanumerically larger string when packed.

    Update: Tye's right, you need 'N' not 'V'

    [0] Perl> print unpack 'H*', pack 'N', 0+"1e$_" for 0 .. 10;; 00000001 0000000a 00000064 000003e8 00002710 000186a0 000f4240 00989680 05f5e100 3b9aca00 ffffffff

    So, using the default sort on packed integers works fine provided that you use the correct pack format to match your platform's endianness. The bonus is, that this is the fastest sort, and by appending the offsets, any equal epochs will be sorted into file order. Try it.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      No joke about that sort! It flies. I am only having one problem with the script. For some reason, it throws a warning while printing the sorted log. According to the output of warn, its always on the last line of the unsorted log.

      One more question, it's about vec. I have not been able to find any good guides on using it. If I implemented it, would I be able to reduce my memory footprint? Does anyone have a good link with a explanation of the function? Something like perlpacktut but for vec? Thanks again. Here is the updated script with the new sort and old warning.

      #!/usr/bin/perl -w use strict; use Date::Calc qw(Mktime Today_and_Now Delta_DHMS); my @starttime = Today_and_Now; print "Begin Index\n"; open (BIGLOG, "< D:/Logs/biglog.unsorted.log") || die "Cannot open log\n"; my @index; while (<BIGLOG>){ my $offset = tell BIGLOG; my $epoch = ( /^\s*#/ or /^\s\n/ or $_ !~ /^\s*\d/ ) ? 0 : Mktime( unpack 'A4xA2xA2xA2xA2xA2', $_ ); push @index, pack 'NN', $epoch, $offset; } print "\nIndexed ". @index ." Lines in "; printf "%02d Days, %02d Hours, %02d Minutes, %02d Seconds\n", Delta_DH +MS( @starttime, Today_and_Now ); print "Begin Sort\n"; my @startsort = Today_and_Now; @index = sort {$a cmp $b} @index; print "Sorted ". @index ." Lines in "; printf "%02d Days, %02d Hours, %02d Minutes, %02d Seconds\n", Delta_DH +MS( @startsort, Today_and_Now ); open (OUTFILE, "> D:/Logs/biglog.sorted.log") || die; foreach (@index){ my $byte = unpack( 'N',substr( $_, 4, 4 )); print OUTFILE readline_n(\*BIGLOG, $byte ) || warn unpack( 'N', substr( $_, 0, 4 )).":".unpack( 'N',substr( +$_, 4, 4 )).$!; } close OUTFILE; close BIGLOG; print "\nTotal runtime:\t\t"; printf "%02d Days, %02d Hours, %02d Minutes, %02d Seconds\n", Delta_DH +MS( @starttime, Today_and_Now); exit; sub readline_n{ my( $fh, $line) = @_; seek $fh, $line, 0; scalar <$fh> }
        For some reason, it throws a warning while printing the sorted log. According to the output of warn, its always on the last line of the unsorted log.

        It's generally a good idea to include a cut&paste of the warning message!

        I don't see any potential for using vec to save memory here.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re^3: Sort big text file - byte offset - 50% there (Added code)
by msalerno (Beadle) on Aug 11, 2006 at 21:41 UTC
    Sure, just after I post the message I realize what I did. It should have been.
    sub unpacked { my($a_num) = unpack( 'V', substr( $a, 0, 4 )); my($b_num) = unpack( 'V', substr( $b, 0, 4 )); return $a_num <=> $b_num; };

      Don't use "V" to pack, use "N", or else your values won't sort correctly.

      - tye