Sort big text file - byte offset

msalerno has asked for the wisdom of the Perl Monks concerning the following question:

I have been working on sorting a large text file by a datestamp within the file. My objective is to get it to run with a minimal memory footprint. The $offsets var has alternating byte offsets and epoch timestamps. I now need to sort them by timestamp. I know that my data structure is not well suited for this application, but due to the fact that I want to keep the memory utilization low, it's the best option. Once the timestamps are in order, I want to print out a sorted log based on the byte offset and timestamp relation. I am hopeing that someone can offer some perls of wisdom. Thanks

#!/usr/bin/perl -w
use strict;
use Date::Calc qw(Mktime Today_and_Now Delta_DHMS);

my @starttime = Today_and_Now;

open (BIGLOG, "< E:/biglog.log") || die "Cannot open log\n";

my $offsets = pack 'I', 0; 
$offsets .= pack 'I', 0;

while (<BIGLOG>){
    $offsets .= pack 'V', tell BIGLOG;

    if ($_ =~ /^\s*#/ || $_ =~ /^\s\n/ || $_ !~ /^\s*\d/ ){
        $offsets .= pack 'V', 0;
    }
    else{
        $offsets .= pack 'V', Mktime(unpack("A4xA2xA2xA2xA2xA2", $_));
    }
    
}

my @endindex = Today_and_Now;
print "\nIndexed ". (length($offsets)/8 -1)." Lines in ";
printf("%02d Days, %02d Hours, %02d Minutes, %02d Seconds", Delta_DHMS
+(@starttime, @endindex));

# EPOCH = ($_ - 1) * 8 + 4;
# BYTE = ($_ - 1) * 8;

##
#
# Sort
#
##

my @endtime = Today_and_Now;

print "\nTotal runtime:\t\t";
printf("%02d Days, %02d Hours, %02d Minutes, %02d Seconds\n", Delta_DH
+MS(@starttime, @endtime));

close BIGLOG;
exit;

sub readline_n{ 
    my( $fh, $line) = @_; 
    seek $fh, unpack( 'V', substr( $offsets, ($line - 1) * 8, 4 )), 0;
    scalar <$fh> 
}
[download]

Comment on Sort big text file - byte offset - 50% there Download Code

Replies are listed 'Best First'.
Re: Sort big text file - byte offset - 50% there (Added code) by BrowserUk (Patriarch) on Aug 11, 2006 at 20:02 UTC
To use Perl's built in sort, requires that you supply an array for sorting. As you are storing your data in a single string, you would (at minimum) need to supply an array of integers, `1 .. NO_OF_RECORDS`, from which you could use your `EPOCH` and `BYTE` offset formulas to substr the string for the comparisons. However, given the small size of your data elements, 8 bytes, and the size of an SV, even an integer one, there would be no memory savings inherent in this. So, rather than concatenating all your epoch/offset pairs into a single string, you'd be better to simply build a normal array of the pack'd pairs. The memory requirement for this will be less than an array of integers PLUS the big string. If you reverse the ordering of the pairings so that the binary encoded epoch comes first and the offset second, then a simple (alphanumeric) sort applied to the array will sort the data by epoch with the offset acting as a tiebreak for equal datetimes. As default alpha sort is the fastest built in sort, and using the array would avoid repeated substring of the data, it will be much faster also. Update: Added modifed code. Update2: Switched 'V's to 'N's commensurate with Tye's comment below. This should be equivalent to your original code but using an array ratehr than the big string. As you can see, the sort becomes simplicity itself. #!/usr/bin/perl -w use strict; use Date::Calc qw(Mktime Today_and_Now Delta_DHMS); my @starttime = Today_and_Now; open (BIGLOG, "< E:/biglog.log") \|\| die "Cannot open log\n"; my @index; while (<BIGLOG>){ my $offset = tell BIGLOG; my $epoch = ( /^\s#/ or /^\s\n/ or $_ !~ /^\s*\d/ ) ? 0 : Mktime( unpack 'A4xA2xA2xA2xA2xA2', $_ ); push @index, pack 'NN', $epoch, $offset; } my @endindex = Today_and_Now; print "\nIndexed ". @index ." Lines in "; printf "%02d Days, %02d Hours, %02d Minutes, %02d Seconds", Delta_DHMS( @starttime, @endindex ); @index = sort @index; my @endtime = Today_and_Now; print "\nTotal runtime:\t\t"; printf "%02d Days, %02d Hours, %02d Minutes, %02d Seconds\n", Delta_DHMS( @starttime, @endtime ); close BIGLOG; exit; sub readline_n{ my( $fh, $line) = @_; seek $fh, unpack( 'x[N]N', $line ), 0; scalar <$fh> } [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^2: Sort big text file - byte offset - 50% there (Added code) by msalerno (Beadle) on Aug 11, 2006 at 21:31 UTC
Thanks, just as I was writing in an array, I noticed your post. The only thing I am stumped on at this point is the sort. The sort should be on the unpacked epoch value. Something like: `@index = sort unpacked @index; sub unpacked { $a = unpack( 'V', substr( $a, 0, 4 )); $b = unpack( 'V', substr( $b, 0, 4 )); $a <=> $b; };` [download] The only problem, is that the data that's in @index after the sort is all messed up. It contains numbers that have nothing to do with the epoch or offset.	[reply] [d/l]
Re^3: Sort big text file - byte offset - 50% there (Added code) by BrowserUk (Patriarch) on Aug 11, 2006 at 21:55 UTC
The sort should be on the unpacked epoch value. No. A big advantage is that your packed binary epock dates should sort perfectly well without being unpacked provided that you use an alphasort (eg. cmp) and not numeric (<=>). And they will sort faster. This is the basis of the Guttman-Rosman Transform (GRT) sort. To convince you of this, look at the binary representation of the following "epochs". ~~Remembering that I am running on a little-endian machine so the byte ordering is reversed~~, each (numerically) bigger number is represented by a alphanumerically larger string when packed. Update: Tye's right, you need 'N' not 'V' `[0] Perl> print unpack 'H*', pack 'N', 0+"1e$_" for 0 .. 10;; 00000001 0000000a 00000064 000003e8 00002710 000186a0 000f4240 00989680 05f5e100 3b9aca00 ffffffff` [download] So, using the default sort on packed integers works fine provided that you use the correct pack format to match your platform's endianness. The bonus is, that this is the fastest sort, and by appending the offsets, any equal epochs will be sorted into file order. Try it. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re^4: Sort big text file - byte offset - 50% there (Added code) by msalerno (Beadle) on Aug 14, 2006 at 15:13 UTC
Re^5: Sort big text file - byte offset - 50% there (Added code) by BrowserUk (Patriarch) on Aug 14, 2006 at 15:32 UTC
Some notes below your chosen depth have not been shown here
Re^3: Sort big text file - byte offset - 50% there (Added code) by msalerno (Beadle) on Aug 11, 2006 at 21:41 UTC
Sure, just after I post the message I realize what I did. It should have been. `sub unpacked { my($a_num) = unpack( 'V', substr( $a, 0, 4 )); my($b_num) = unpack( 'V', substr( $b, 0, 4 )); return $a_num <=> $b_num; };` [download]	[reply] [d/l]
Re^4: Sort big text file - byte offset - 50% there ("V"/"N") by tye (Sage) on Aug 11, 2006 at 21:56 UTC