technoz has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks, I am trying to write a perl script which takes a list of strings from a file as input and searches each string in all the files in a directory and Prints the name of the file which contains that string. The list contains normally 300 strings and the directory contains around 4000 files. So it is taking a lot of time to run. Is there any way to index the directory and search through perl so that the search becomes fast???
  • Comment on Searching multiple expressions in multiple Files

Replies are listed 'Best First'.
Re: Searching multiple expressions in multiple Files
by almut (Canon) on Mar 31, 2009 at 11:30 UTC
    The list contains normally 300 strings...

    Are you using Perl 5.10? With its new trie optimisation for alternations it should be relatively fast...

    Indexing everything first would only have benefits if you're searching repeatedly.  In case you're searching just once, any potential gains in speed by using an index would likely be more than outweighed by the time it takes to create the index...

Re: Searching multiple expressions in multiple Files
by ELISHEVA (Prior) on Mar 31, 2009 at 11:37 UTC

    Please post some code. It is very hard to know what might be your problem without seeing how you are solving the problem.

    Opening and closing files is very slow - one mistake people often make is to grab a string, read all 4000 files, grab another string, read all 4000 files, and so on. (1,200,000 file openings!). You can usually avoid this by reading through each file only once and saving the data you need in a hash.

    How you go about reading in and saving the data depends a lot on the nature of the strings you are searching for. Are your strings whole words or sequences that can be found in the middle of words and/or spread across several words?

    Best, beth

      To add on to Beth's idea, you'll need to open the 4000 files at some point, it may be easiet to read in the 100-300 strings, cache them in a structure (using less memory) and then iterate over the 4000 files - 1 time each - looking for all ~300 strings?

      The format of the string goes something like this "abcdefg01.123" in all the files this string is preceeded by blank space and followed by a ";".

      So the string is actually a combination of two words.
Re: Searching multiple expressions in multiple Files
by jwkrahn (Abbot) on Mar 31, 2009 at 12:20 UTC

    If you have fgrep and sort available then you could do it like this:

    fgrep -o -f list_of_strings_file directory/* | sort -u

      A quick benchmark shows that fgrep is even about three times as fast as perl. (For 5.10.0, that is — with 5.8.8, the ratio is even worse... considerably!)  Such a large difference is somewhat surprising (to me). I would've expected perl-5.10 to be about on par with fgrep. So, please, someone point out what I've done wrong in my benchmark... :)

      perl-5.10.0:

      Rate perl fgrep perl 30.8/s -- -71% fgrep 106/s 246% --

      perl-5.8.8:

      s/iter perl fgrep perl 9.38 -- -100% fgrep 3.00e-04 3124967% --

      (In this particular reported case, none of the search words were found in $^X (the perl binary), so all strings had to be tested.)

      Thanks for the suggestions.
      Unfortunately I dont have perl or the script to try upon till tomorrow. I will try and let you know the results.