Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
Last week I posted a couple of articles on the subject of "Sorting with Perl" mainly dealing with speed issues. Well, today I am here to share some of my newest findings.

Quick Overview

I was trying to sort a log file that was generated by a multi-threaded application, the killer was this file was 1,401,986 lines with 1,401,950 of those matching my criteria. When I sorted this file using perls sort it took a little over 12 hours to run!!

The Finding

With sorting there are many different ways to to reach the same result, however the time to reach there can be radically different. The key to sorting is knowing as much as possible about your data and adjusting your algorithims to this.

The Data

With my case I knew quite a bit about my data, I knew it was being generated by a multithreaded process that was reading many files at the same time. Thus my file order was completely random but the files were processed linear so there was still a certain amount of order saved. Since the threads reported back at different time you get something like this example:

CD1\01100809.pdf(1) - [Account Number] Indexed key "654546654". CD1\01100813.pdf(1) - [Account Number] Indexed key "654546654". CD1\01100809.pdf(1) - [Name] Indexed key "Bob".

The Result

First off let me just say I couldn't have reached this stage if it wasn't for some of the wonderful feedback from this site and from Mastering Algorithims in Perl by O'Reilly. This book breaks down the worse type and best type of data to give to each type of sort algorithim. In this book I found that a bubblesort is one of the worse methods to sort on with random data but for sorted or near sorted data it was the fastest. At first I though my data was near sorted data, so I went ahead and just used the smartbubble sort example from the book, to my suprise I ended up killing the process after 75 hours had elapsed!!! What I quickly realized was yes my data is sort but in a "random way" based of the insertion of the threads. So when I started to build my sort key I used the filename as a hash key instead of part of my sort, and built an array in each hash value. Now instead of having one huge array, I had one array for each file that was in near sorted order!!! Now all I had to do was a bubble on each array and I was done, with this method I was able to sort the file in under 3 minutes. This was even using references of the line offset to minimize my memory foot print (a savings of ~40% with about a 15% time gain ). My code is shown below as an example I have included the sort routine that I largely barrowed from the O'Reilly book. If anyone sees an issue with posting that snippet from the book as a copywrite violation please let me know, but O'Reilly's ftp site has it avaliable for anonymous access on their ftp server.

open(IN, $infile ) || die "can't open logfile: $infile\n"; open(OUT, ">$outfile" ) || die "can't create datfile: $outfile\n"; my ($offset, %sorton); while (<IN>) { if( m/^(.+?)\((\d+)\)\s-\s\[(.+?)\].+?"(.*?)"\.$/ ) { push @{$sorton{$1}}, [pack("N",$2).$3, $offset]; 1; } $offset = tell(IN); } for my $file ( sort keys %sorton ) { println( bubble($sorton{$file}) ); } sub bubble { my $array = shift; my $start = 0; my $i = $#$array; while(1) { my $new_start; my $new_end = 0; for ( my $j = $start || 1; $j <= $i; $j++ ) { if ( $array->[$j-1][0] gt $array->[$j][0] ) { @$array[$j, $j-1] = @$array[ $j-1, $j]; $new_end = $j-1; $new_start = $j-1 unless defined $new_start; } } last unless defined $new_start; $i = $new_end; $start = $new_start; } return $array; } sub println { my $array = shift; for my $line (@$array) { seek IN, $line->[1], 0; print OUT scalar(<IN>); } }

In reply to Speeding up large file sorts by orbital

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?

What's my password?
Create A New User
Domain Nodelet?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (4)
As of 2023-12-04 17:07 GMT
Find Nodes?
    Voting Booth?
    What's your preferred 'use VERSION' for new CPAN modules in 2023?

    Results (25 votes). Check out past polls.