in reply to Best way to search file

If your process is so slow, it is quite likely because your are scanning the full content of file2 for each line of file1.

If this is the case, then you will find that storing file2 in a hash before starting to process file1 will make the process incredibly faster. And the larger file2 is, the higher the speed gain.

As mentioned by sundialsvc4, the only limit to that is that if file2 is so big that the hash will take all the memory, then the hash is no longer a solution. (It depends on your system, but with today's typical RAM, my experience is that the limit could be somewhere between 5 and 15 million lines for file2.)

In that case, I would really recommend sorting the files and reading sequentially both files in parallel. This is in my experience with huge files way faster than using a database. The only downside with this approach is that the algorithm for reading 2 files in parallel can be a bit tricky, with quite a few edge cases to be taken care of.

Je suis Charlie.

Replies are listed 'Best First'.
Re^2: Best way to search file
by insta.gator (Novice) on Apr 15, 2015 at 18:59 UTC

    Thanks for the response. I have yet to come across a file 2 that is greater than 100,000 lines. So I think that I will look at loading file2 into a hash and search that. Now I just have to figure out how to do that... :-)

    Thanks for the help. Depending on my results, I may reach out for more assistance.

      Now I just have to figure out how to do that... :-)
      choroba gave you the basic idea of the hash solution in this post: Re: Best way to search file in answer to your OP. But feel free to come back if you encounter implementation problems.

      Je suis Charlie.

      Feel free to reach out, but I doubt that you will have any trouble with it, once you’ve studied the previous example.   (If you do, don’t waste your own time:   ask.)

      Also:   when you load the data into your hash, you should not take for granted that there is not an error in your input-file.   As you load the hash, I would recommend that you test to see if the key already exists() in the hash, and die() if it does.   “Trust, but verify.”

      The data volumes that you indicate certainly seem to be appropriate for the use of a hash, and that’s the way I would pursue it.

        Thanks Sundial. A couple of questions.

        I was able to get the hash created and working properly. Now I need to take care of some details. Depending on the type of file that I am using, the SSN may or may not have hyphens in it. How would you strip the hyphens while loading the hash? This is what I have now:

        while (<$HRDATA>) { my ($ssn,$aoid) = split(/","/)[4,2]; $ssnhash{$ssn} = $aoid; }

        Basic I am sure but I am just learning.

        Secondly, again, depending on file type, the SSN may be in field 2 or 4 of file 2. One file type, where the SSN is in field 2 has a file header at the top. The only way that I can see to programatically know which is which it to query the file line of the file. Once I know that I can tweak my code to load the SSN in the hash from the proper fields. Does that make sense? Any thoughts on a better way?

        Thanks!!