|Perl Monk, Perl Meditation|
Speeding up large file sortsby orbital (Scribe)
|on Nov 30, 2001 at 06:13 UTC||Need Help??|
orbital has asked for the wisdom of the Perl Monks concerning the following question:
Last week I posted a couple of articles on the subject of "Sorting with Perl" mainly dealing with speed issues. Well, today I am here to share some of my newest findings.
I was trying to sort a log file that was generated by a multi-threaded application, the killer was this file was 1,401,986 lines with 1,401,950 of those matching my criteria. When I sorted this file using perls sort it took a little over 12 hours to run!!
With sorting there are many different ways to to reach the same result, however the time to reach there can be radically different. The key to sorting is knowing as much as possible about your data and adjusting your algorithims to this.
With my case I knew quite a bit about my data, I knew it was being generated by a multithreaded process that was reading many files at the same time. Thus my file order was completely random but the files were processed linear so there was still a certain amount of order saved. Since the threads reported back at different time you get something like this example:
First off let me just say I couldn't have reached this stage if it wasn't for some of the wonderful feedback from this site and from Mastering Algorithims in Perl by O'Reilly. This book breaks down the worse type and best type of data to give to each type of sort algorithim. In this book I found that a bubblesort is one of the worse methods to sort on with random data but for sorted or near sorted data it was the fastest. At first I though my data was near sorted data, so I went ahead and just used the smartbubble sort example from the book, to my suprise I ended up killing the process after 75 hours had elapsed!!! What I quickly realized was yes my data is sort but in a "random way" based of the insertion of the threads. So when I started to build my sort key I used the filename as a hash key instead of part of my sort, and built an array in each hash value. Now instead of having one huge array, I had one array for each file that was in near sorted order!!! Now all I had to do was a bubble on each array and I was done, with this method I was able to sort the file in under 3 minutes. This was even using references of the line offset to minimize my memory foot print (a savings of ~40% with about a 15% time gain ). My code is shown below as an example I have included the sort routine that I largely barrowed from the O'Reilly book. If anyone sees an issue with posting that snippet from the book as a copywrite violation please let me know, but O'Reilly's ftp site has it avaliable for anonymous access on their ftp server.