Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Fellow Monks,

I have to write a script to search for a string, it has to search for a string in about a thousand different text files. I did a search, and File::Grep seemed like it might be what I want. Other than using this module I just would like to know what would be the most efficient method for achieving this?

Much appreciated
Jonathan

Replies are listed 'Best First'.
Re: Searching through text files
by dragonchild (Archbishop) on Mar 23, 2004 at 16:36 UTC
    How efficient does it need to be? What system will this script need to run on? You might be better off not using Perl for this. For example, on Unix-like systems, the grep command might be just the ticket. And, it'll be more efficient (in most measures) than any Perl solution can be.

    ------
    We are the carpenters and bricklayers of the Information Age.

    Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

      I have to disagree. Depending on how complex your regex is and what you want to do once you find the string in the file, grep is usually NOT more efficient than Perl. One major reason is that grep uses a text-directed regex engine. It searches for the "best match" in a string so it has to search through the whole string even if it finds a match before it reaches the end. Perl uses a regex-directed engine and it returns the "left-most match". So the instant it finds a match, the Perl regex engine returns that match and moves on.

      I've seen this demonstrated when dealing with very large log files(>2G). Perl was able to do in a few minutes what it was taking grep 10+ minutes to accomplish.

      Check out Jeffrey Friedl's book Mastering Regular Expressions. It's an absolutely fascinating read on regexes and regex engines.

      Later

        While this is definitely true for some regexps, Anonymous says in his question:

        I have to write a script to search for a string, it has to search for a string in about a thousand different text files.

        For strings, probably grep -F is fast enough.

        If, however, the needle string contains newlines or nul characters, then this may be difficult to achieve with grep, so it may be better for perl. Also, if the file has very long lines (or no newlines at all), you can't make grep print only where the string is, it either wants to print the whole line, or the line number, or only give you a truth value. In such cases, Perl may be better (or some other program). Also, on windows, if you only have find installed, no real grep, you may have to use Perl.

      Hi dragonchild,

      thanks for your prompt response. I don't have an answer on how efficient it needs to be, I just want to know the quickest way to do the task. I guess all I was really looking for was the quickest technique for doing this task using perl.

      Thanks a lot
      Jonathan

        Below is a simple solution with little perl code and relying on standard UNIX tools (find and grep). grep -F is used for fast (fixed string instead of regexp) search. It searches all the files in $DIR and all its subdirectories and obtains the names of those matching $STRING. Note that Perl 5.8.0+ is required (for the safe version of open). If you don't have it you must do the shell escaping yourself.

        #!/usr/bin/perl # safe form of IPC open requires perl 5.8.0 use v5.8.0; my $DIR = '/some/directory'; my $STRING = 'hidden!'; open my $fh, '-|', 'find', $DIR, qw/-type f -exec grep -lF/, $STRING, +qw/{} ;/ or die $!; chomp (my @found = <$fh>); # @found now contains the list of files matching the string;

Re: Searching through text files
by tachyon (Chancellor) on Mar 23, 2004 at 17:26 UTC

    swish-e is a great search tool. It even has a nice Swish::API in perl. apache.org use it for instance to search their site. It will index and search any text file and there are plugins for all sorts of formats to convert them to text for indexing. Index searches are the way to go rather than grepping every file every time you search. With swish-e the core code is C, it handles stemming and indexing for you and you get a nice stable solution with a solid XS API.

    cheers

    tachyon

      I didn't even know that swish was still around. I used it quite a bit 5 or 6 years ago to add site search functionality for some websites I was working on. We eventually moved to verity due to the ammount of text we eventually had to search but I really dug swish.

        Alive an kicking (ass). The Swish::API XS interface makes it totally accessible from perl with no forking code system calls to get at it. We use the current version to search and index some quite websites and it runs search+custom highlight times in the order of several milliseconds. No stability issues AFAIK.

        cheers

        tachyon

Re: Searching through text files
by TomDLux (Vicar) on Mar 24, 2004 at 06:55 UTC

    The most efficient, in terms of moving the task frrom the to-do list to the done list, is:

    # at the Unix command line: $ find $dir -type f -exec grep 'foo bar baz' {} > /dev/nul \; -print

    Search recurcively below the directory $dir, and if a found object is a file, grep for the constant string, 'foo bar baz'; Send the output to dev null, we only want to know if the text was foujnd. If it was, ,print the name of the file.

    --
    TTTATCGGTCGTTATATAGATGTTTGCA