comment on

To use Perl's built in sort, requires that you supply an array for sorting. As you are storing your data in a single string, you would (at minimum) need to supply an array of integers, 1 .. NO_OF_RECORDS, from which you could use your EPOCH and BYTE offset formulas to substr the string for the comparisons.

However, given the small size of your data elements, 8 bytes, and the size of an SV*, even an integer one, there would be no memory savings inherent in this. So, rather than concatenating all your epoch/offset pairs into a single string, you'd be better to simply build a normal array of the pack'd pairs. The memory requirement for this will be less than an array of integers PLUS the big string.

If you reverse the ordering of the pairings so that the binary encoded epoch comes first and the offset second, then a simple (alphanumeric) sort applied to the array will sort the data by epoch with the offset acting as a tiebreak for equal datetimes.

As default alpha sort is the fastest built in sort, and using the array would avoid repeated substring of the data, it will be much faster also.

Update: Added modifed code.

Update2: Switched 'V's to 'N's commensurate with Tye's comment below.

This should be equivalent to your original code but using an array ratehr than the big string. As you can see, the sort becomes simplicity itself.

#!/usr/bin/perl -w
use strict;
use Date::Calc qw(Mktime Today_and_Now Delta_DHMS);

my @starttime = Today_and_Now;

open (BIGLOG, "< E:/biglog.log") || die "Cannot open log\n";

my @index;

while (<BIGLOG>){
    my $offset = tell BIGLOG;
    my $epoch = ( /^\s*#/ or /^\s\n/ or $_ !~ /^\s*\d/ ) ? 0 
        : Mktime( unpack 'A4xA2xA2xA2xA2xA2', $_  );
    push @index, pack 'NN', $epoch, $offset;
}

my @endindex = Today_and_Now;
print "\nIndexed ". @index ." Lines in ";
printf "%02d Days, %02d Hours, %02d Minutes, %02d Seconds", 
    Delta_DHMS( @starttime, @endindex );

@index = sort @index;

my @endtime = Today_and_Now;

print "\nTotal runtime:\t\t";
printf "%02d Days, %02d Hours, %02d Minutes, %02d Seconds\n", 
    Delta_DHMS( @starttime, @endtime );

close BIGLOG;
exit;

sub readline_n{ 
    my( $fh, $line) = @_; 
    seek $fh, unpack( 'x[N]N', $line ), 0;
    scalar <$fh> 
}
[download]

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

In reply to Re: Sort big text file - byte offset - 50% there (Added code) by BrowserUk
in thread Sort big text file - byte offset - 50% there by msalerno

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.