the_slycer has asked for the wisdom of the Perl Monks concerning the following question:

Greetings

I have written a script that searches a list of text files for both the filename and searches the text of the files for a match (depending on which button you push, this is TK'd). The filename search is obviously very fast. But the "grep" of the files is extremely slow. I was hoping I could find a way to speed this up. There is a list of 250ish files, and they amount (in total) to no more than 350k. I have implemented a poor mans "cache" to try and speed up the second search for the same value. Here is the code snippet that I'm using for the search:
foreach $filename(@file_array){ chomp ($filename); open (FH,"$filename") or warn "Could not open $filename $!"; while ($line = <FH>){ if ($line =~ /.*$search_value*/i){ ++$matched; $file_listbox->insert('0',"$filename--> $line"); open (RECENT,">>$installpath/recent"); print RECENT "$search_value= $filename--> $line"; $numhash{"$search_value"}="true"; close (RECENT); } } close (FH); }
The search tool shows (as is obvious from above) the whole line that the search value was found on.

Any assistance would be deeply appreciated.

Replies are listed 'Best First'.
Re: Faster way?
by Fastolfe (Vicar) on Oct 06, 2000 at 22:08 UTC
    Firstly, your regexp /.*$search_value*/ will not do quite what you think. If $search_value is 'abc', the string "xyzab" will match, since the trailing * is applied to the 'c' in your resulting pattern. The regular expression will also cause your code to die if $search_value has characters that goof up the regexp compilation.
    chomp(@file_array); open(RECENT, ">>$installpath/recent"); # or just > foreach $filename (@file_array) { open(FH, "< $filename") or warn "Could not open $filename: $!"; while (<FH>) { if (/$search_value/o) { print RECENT "$search_value=$filename--> $_"; $numhash{$search_value}++; $matched++; } } close(FH); } close(RECENT);
    We move the RECENT file stuff outside of your loop, since it makes little sense to keep re-opening the file for every line we want to write. I imagine that's a major source of your speed problems. Since $search_value doesn't change, we optimize the regular expression with the /o switch. If you wanted to forget about subsequent matches in a given file, you could add a last statement inside your loop, which would skip to the next file instead of reading the rest of the current file, but it seems like you're interested in each line that matches.

    If your 'recent' file is a temporary/transient thing, used only for processing later in your script, you might also want to consider just storing your matches in an internal data structure, and use them later instead of reading from your file:

    push(@{$search_results{$search_value}->{$filename}}, $line);
    Of course, don't underestimate the simplicity of doing this without Perl, if that's all you have to do. The 'grep' command can perform this task natively, unless you need to do some additional processing on the data, and aren't just building a 'recent' file with text matches.

      Unless the people using your script can be trusted with learning regexp syntax, you may wish to write that as: if (/\Q$search_value\E/o) { with the \Q doing a quotemeta on the string to make sure there aren't "confusing" things in there. Sooner or later some bright-boy will try and search for ".*" for some reason and be lucky enough to have your script return all 350k of data to him...

      If you are cleaning up and fixing the search pattern elsewhere, ignore this.

      --
      $you = new YOU;
      honk() if $you->love(perl)

      Actually I want it matching the search value regardless of how (where) it comes up in the line :-).
      There is a lot more that the prog does, it's not for MY value, this is for a bunch of people that have never used grep. Plus we are using win2k here, this app is faster than the "find" command. One of those issues of people using an app for a while that don't want to see it go. We used to use scripts on DEC/VMS to do this :-)
        /$search_pattern/ is unanchored:
        "abcdefg" =~ /cd/ # true "abcdefg" =~ /^cd/ # anchored at start, false "abcdefg" =~ /fg$/ # anchored at end, true
        The regular expression I provided will match anywhere in the line, unless you've modified that behavior by inserting a ^ or $ at the beginning of the search string, or the end, respectively.
Re: Faster way?
by jreades (Friar) on Oct 07, 2000 at 00:36 UTC

    With what sort of frequency are the files updated -- is it worth stuffing matches into a database and expiring them every couple of days?

    Are matches unique, or multiple? (I think the later, but this isnt' spelled out)

    Depending on the frequency of modification, you could use a table setup something like the following:

    files matches
    file_id UNIQUE INT
    path UNIQUE VARCHAR/TEXT
    last_access-stat TIME
    word VARCHAR
    file_id INT
    position INT

    This is just an off-the-cuff format, nothing to run a db by...

    A reply falls below the community's threshold of quality. You may see it by logging in.