comment on

After posting my example of a Schwartzian Transform in Complex Sorting Optimization? for suggestions on optimization I made a few more discoveries.

I need to Thank tye for his sorting alternative, by building a single sort key and then sort of that, instead of or`ing my sort priority. I instantly notice 4 times the performance working on a 3MB file.

This enlightnment from tye made me question what other ways can I sort by and how can I sort faster, that is when I found A Fresh Look at Efficient Perl Sorting by Uri Guttman and Larry Rosler. This paper is incredible, if you have any questions about your own sorting techniques go read this paper now!!!

Taking my new sense of enlightnment I used tye code example to try and sort a very large log file. My test logfile was 100MB, my first problem was memory, loading the whole log into memory eventually sucked all my resources (382 MB of ram), since I was also building sort arrays and then an index array. So I made a very slight modification by making an array with references to the matching line not the line itself, I did this by getting the offset of the line. This saved me quite a bit of memory overhead at least enough to run the script without hitting virtual memory. Here is the new version of the code snippet:

my $infile = $ARGV[0];
my $outfile = $infile;
$outfile =~ s/.{3}$/out/i;

open(IN, $infile ) || die "can't open logfile: $infile\n";
open(OUT, ">$outfile" ) || die "can't create datfile: $outfile\n";

my ($offset, @sorton);
while (<IN>) {
    if(  m/^(.+?)\((\d+)\)\s-\s\[(.+?)\].+?"(.*?)"\.$/  ) {
        push @sorton, $1.pack("N",$2).$3;
        push @lines, $offset;
        1;
    } 
    $offset = tell(IN);
}

my @idx= sort { $sorton[$a] cmp $sorton[$b] } 0..$#sorton;

for( @lines[@idx] ) {
 seek IN, $_, 0;
 print OUT scalar(<IN>);
}
[download]

Yes I know I should be putting an exclusive lock on my file esp.since I am keying off the line offsets.

What I found from running this script on the large file was horrific. It took over 12 hours to sort this file! With a few very crued checks I found that the bottleneck was the line where I sort into @idx. All the steps above are completed within minutes, and the write step is done in less time then that. I guess my frusration is that this code sample was ported into C++, using a lot of the same techinques and the whole 100MB file ran in 2 minutes!!!! Thats a huge difference.

Does perl built in sort have that much overhead? Would I be better off not using the built in sort at all?

In reply to Slow at sorting? by orbital

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.