Re: quicker way to merge files?

Grandfather's advice is spot on. The file operations are very expensive in terms of performance. Looks like for a lot of lines in DATA, REFILE is re-opened and the DATA2 file is re-read and re-parsed.

Just moving the open of REFILE to the top of the code will save one very expensive file system operation for every line in DATA. If one of the files is small enough to fit into memory, something as simple as @lines =<DATA2>; will produce significant CPU savings because reading through an array in memory is MUCH faster than continually re-reading it off of the disk.

There is some speculation involved in this next suggestion as I have no idea of the size of the files. But the CPU savings will be enormous if this works out...It appears that you are checking for each line in DATA if there is a line with 2 key parameters contained within the DATA2 file which match with the current line under inspection in DATA. If so then an output line is generated.

BTW, I quite frankly found this blizzard of $var9,$var13 type stuff to be very confusing. Better variable names would help immensely!

Anyway, if you read say DATA2 first and create a %data2 hash with keys like: $data2{"$var10;$var11"}=1; Then as you read DATA, you check for the existence of $data2{"$var3;$var4"} and if so print $var1 $var2 $var4, I think that would work. The size of %data2 could get huge, hundreds of thousands of keys aren't out of the question.

Pitfalls: %data2 is just too big to fit into memory. If so then things get more complex if you want this to run really fast - but its still possible. It appears that some of what you have as \S+ in the regex are really numbers. There can be some "mismatch" when dealing with leading zeroes. In Perl everything is a string until it is used in a numeric context. One trick to delete leading zeroes is to just add "0" to the number. $var +=0; Now when you can use $var as part of a hash key, it won't have any leading zeroes. That's important if one file had "033" and the other "00033".

In the best case, read each line of data once, parse it once. Building even what might seem to be "huge" hash tables is not nearly as expensive as re-reading a file over and over again. If you are dealing with files of just some few hundred MB, execution time in the seconds is not an unreasonable expectation.

Comment on Re: quicker way to merge files?