Speeding up large file sorts

orbital has asked for the wisdom of the Perl Monks concerning the following question:

Last week I posted a couple of articles on the subject of "Sorting with Perl" mainly dealing with speed issues. Well, today I am here to share some of my newest findings.

Quick Overview

I was trying to sort a log file that was generated by a multi-threaded application, the killer was this file was 1,401,986 lines with 1,401,950 of those matching my criteria. When I sorted this file using perls sort it took a little over 12 hours to run!!

The Finding

With sorting there are many different ways to to reach the same result, however the time to reach there can be radically different. The key to sorting is knowing as much as possible about your data and adjusting your algorithims to this.

The Data

With my case I knew quite a bit about my data, I knew it was being generated by a multithreaded process that was reading many files at the same time. Thus my file order was completely random but the files were processed linear so there was still a certain amount of order saved. Since the threads reported back at different time you get something like this example:

CD1\01100809.pdf(1) - [Account Number] Indexed key "654546654".
CD1\01100813.pdf(1) - [Account Number] Indexed key "654546654".
CD1\01100809.pdf(1) - [Name] Indexed key "Bob".
[download]

The Result

First off let me just say I couldn't have reached this stage if it wasn't for some of the wonderful feedback from this site and from Mastering Algorithims in Perl by O'Reilly. This book breaks down the worse type and best type of data to give to each type of sort algorithim. In this book I found that a bubblesort is one of the worse methods to sort on with random data but for sorted or near sorted data it was the fastest. At first I though my data was near sorted data, so I went ahead and just used the smartbubble sort example from the book, to my suprise I ended up killing the process after 75 hours had elapsed!!! What I quickly realized was yes my data is sort but in a "random way" based of the insertion of the threads. So when I started to build my sort key I used the filename as a hash key instead of part of my sort, and built an array in each hash value. Now instead of having one huge array, I had one array for each file that was in near sorted order!!! Now all I had to do was a bubble on each array and I was done, with this method I was able to sort the file in under 3 minutes. This was even using references of the line offset to minimize my memory foot print (a savings of ~40% with about a 15% time gain ). My code is shown below as an example I have included the sort routine that I largely barrowed from the O'Reilly book. If anyone sees an issue with posting that snippet from the book as a copywrite violation please let me know, but O'Reilly's ftp site has it avaliable for anonymous access on their ftp server.

open(IN, $infile ) || die "can't open logfile: $infile\n";
open(OUT, ">$outfile" ) || die "can't create datfile: $outfile\n";

my ($offset, %sorton);

while (<IN>) {
    if(  m/^(.+?)\((\d+)\)\s-\s\[(.+?)\].+?"(.*?)"\.$/  ) {
        push @{$sorton{$1}}, [pack("N",$2).$3, $offset];
        1;
    } 
    $offset = tell(IN);
}

for my $file ( sort keys %sorton )
   {
    println( bubble($sorton{$file}) );
   }

   
sub bubble {
    my $array = shift;
    my $start = 0;
    
    my $i = $#$array;
    
    while(1) {
             my $new_start;
             my $new_end = 0;
             
             for ( my $j = $start || 1; $j <= $i; $j++ ) {
                 if ( $array->[$j-1][0] gt $array->[$j][0] ) {
                     @$array[$j, $j-1] = @$array[ $j-1, $j];
                    $new_end = $j-1;
                    $new_start = $j-1 unless defined $new_start;
                }
            }
            last unless defined $new_start;
            $i = $new_end;
            $start = $new_start;
        }
    return $array;
}

sub println
   {
    my $array = shift;
    for my $line (@$array)
        {
          seek IN, $line->[1], 0;
          print OUT scalar(<IN>);
        }
   }
[download]

Comment on Speeding up large file sorts Select or Download Code


"be consistent"
	PerlMonks