meetraz has asked for the wisdom of the Perl Monks concerning the following question:

I am iterating through a file line by line. The file is currently 200k lines, 100+ MB, but will grow larger over time. For each line in the file, I need to compare it to a set of regexs. Currently, that set is about 15 patterns, but that too will get larger over time. Eventually 50-100 patterns. Here is what I'm doing now:
my $bigpattern = '(' . join('|',@patterns) . ')'; while (my $inputline = <FILE>) { if ($inputline =~ /$bigpattern/io) { ## Logic goes here } }
This doesn't seem to be the most elegant or effecient way to do this. Can anybody recommend a better practice?

Replies are listed 'Best First'.
Re: Bulk Regex?
by RMGir (Prior) on Aug 29, 2002 at 15:44 UTC
    Regex::PreSuf would give you a more optimal "$bigpattern". Apart from that, nothing else springs to mind.

    Could you find some simpler way to recognize "interesting" lines? Maybe they have some other characteristic, apart from matching one of your regexes (regexs, regexen, whichever you prefer:)).
    --
    Mike

      Thanks! I tried this, and it appears to speed execution by 5-7% with the current environment. Basically, this application accepts a list of department code prefixes and searches through a list of employees to extract those that match. I'm actually split()ing each line of the file and matching only one field, using the other fields in business logic if any of the department code prefixes match.
        Aha! Then don't use a regex.
        # substitute in your dept codes, and a better hash name :) my %deptCodesICareAbout=( RSRCH => 1, SALES => 1 ); # substitute in the right FILE, field list, and delimiter... while(<FILE>) { my ($code, #whateverelse )=split/,/; if($deptCodesICareAbout{$code}) { # business logic } }
        That will be MUCH faster.
        --
        Mike
Re: Bulk Regex?
by Zaxo (Archbishop) on Aug 29, 2002 at 16:30 UTC

    There are several things you can do for efficiency: not much for elegance.

    1. As you read each line from the file, study it.
    2. Apply qr// to each of @patterns.
    3. Do not combine @patterns into a larger pattern. Iterate instead
    4. Quit iterating at the first match.
    5. Keep score. Sort @patterns so that the commonest matches come first.

    Not all these will help or may apply. For instance, a parser may have fixed priorities for matches, preventing use of #5. As for the others, Benchmark. Strategy #4 is observed by the big alternation, so should be used in any iterative scheme, too..

    Update: Bah, underspecified. It sounds like you want an rdbms and DBI to store your data. RMGir's hash idea is good too.

    After Compline,
    Zaxo

Re: Bulk Regex?
by simeon2000 (Monk) on Aug 29, 2002 at 20:58 UTC
    qr// is sweet. Big patterns are slower than many small patterns.
    UNTESTED CODE:

    my @codes = qw/ CODE1 CODE2 CODE3 /; my @regex = map { qr/$_/i } @codes; while (my $inputline = <FILE>) { my $found = 0; foreach my $regex (@regex) { last if ($found) = $inputline =~ /($regex)/); } next unless $found; # logic goes here }
    Make sure you put the most common codes at the front of your @codes array for more speed, since you'll do less searching.

    --
    perl -e "print qq/just another perl hacker who doesn't grok japh\n/"
    simeon2000|http://holdren.net/

      Make sure you put the most common codes at the front of your @codes array for more speed, since you'll do less searching.

      Interesting idea... I wonder if dynamically resorting as you go would help?

      my @codes = qw/ CODE1 CODE2 CODE3 /; my %hitCounts; my @regex = map { $hitCounts{$_}=1; qr/$_/i } @codes; # tune this parameter for optimal performance, balancing better orderi +ng # of regexen with sort costs... my $resortFreq=1000; my $iterCount=0; while (my $inputline = <FILE>) { my $found = 0; foreach my $regex(@regex) { if ($inputline =~ /$regex/) { $hitCounts{$regex}++; $found = 1; last; } } # re-sort every 1000 lines. The "1000" is a parameter that prolly + should # be tuned if(++$iterCount%$resortFreq == 0) { @regex=sort {$hitCounts{$b}<=>$hitCounts{$a}} @regex; } next unless $found; # logic goes here }
      At the very least, you could do this once, and then feed the results back into the top of the script. If the frequencies are about constant (which makes sense for department populations in a large firm, I guess), this should make a lot more sense than dynamically resorting each time...
      END { print "Regexen in sorted order:\n\t"; print join "\n\t",sort {$hitCounts{$b}<=>$hitCounts{$a}} @regex; print "\n"; }
      An other alternative would be to have the end block write out a file of regexen, which the script could read back in. Then each run would be as good as it could be, based on the results of the previous run...
      --
      Mike
Re: Bulk Regex?
by zentara (Cardinal) on Aug 29, 2002 at 17:35 UTC
    Maybe pre-compiling the regex would speed things up, so
    the regex isn't compiled for each line? It's in the faq somwhere.
    my $bigpattern = '(' . join('|',@patterns) . ')'; $b = qr/\Q$bigpattern\E/; while (my $inputline = <FILE>) { if ($inputline =~ /$b/io) { ## Logic goes here } }
      Sorry, he was already using /o. I don't think qr// buys you anything over /$var/o, but I could be wrong.

      You're right, though, it _is_ a faq:

      What is "/o" really for? Using a variable in a regular expression match forces a re-evaluat +ion (and perhaps recompilation) each time the regular expression is encountered. The "/o" modifier locks in the regex the first time i +t's used. This always happens in a constant regular expression, and in + fact, the pattern was compiled into the internal format at the same time + your entire program was. Use of "/o" is irrelevant unless variable interpolation is used in + the pattern, and if so, the regex engine will neither know nor care wh +ether the variables change after the pattern is evaluated the *very firs +t* time. "/o" is often used to gain an extra measure of efficiency by not performing subsequent evaluations when you know it won't matter (b +ecause you know the variables won't change), or more rarely, when you don +'t want the regex to notice if they do. For example, here's a "paragrep" program: $/ = ''; # paragraph mode $pat = shift; while (<>) { print if /$pat/o; }

      Edit:Time for Benchmark:

      #!/usr/bin/perl -w use strict; use Benchmark qw(cmpthese); my $pattern=join "|",qw(a ab abc abcde ef gh ghij qrst nqmz stuv); my $qr=qr/$pattern/; my @candidates=qw(abc zzz stuv asasdfasdfaf qqqqqqqqqqqqqqqq lkasjh ad +sd qqqqqqqqqqqqabc); sub preComp { my $result=scalar grep /$pattern/o,@candidates; return $result; } sub preCompQR { my $result=scalar grep /$qr/o,@candidates; return $result; } sub noPrecomp { my $result=scalar grep /$pattern/,@candidates; return $result; } print "Working from pattern '$pattern', qr='$qr'\n"; print "preComp finds ",preComp(),"\n"; print "preCompQR finds ",preCompQR(),"\n"; print "noPrecomp finds ",noPrecomp(),"\n"; cmpthese(-3, { preComp => \&preComp, preCompQR => \&preCompQR, noPrecomp => \&noPrecomp } );
      On perl 5.6.1, here are the results:
      Working from pattern 'a|ab|abc|abcde|ef|gh|ghij|qrst|nqmz|stuv', qr='( +?-xism:a|ab|abc|abcde|ef|gh|ghij|qrst|nqmz|stuv)' preComp finds 6 preCompQR finds 6 noPrecomp finds 6 Benchmark: running noPrecomp, preComp, preCompQR, each for at least 3 +CPU seconds... noPrecomp: 4 wallclock secs ( 3.09 usr + 0.01 sys = 3.10 CPU) @ 33 +414.22/s (n=103417) preComp: 4 wallclock secs ( 3.24 usr + 0.01 sys = 3.25 CPU) @ 35 +360.06/s (n=115097) preCompQR: 2 wallclock secs ( 3.08 usr + 0.02 sys = 3.10 CPU) @ 35 +094.39/s (n=108933) Rate noPrecomp preCompQR preComp noPrecomp 33414/s -- -5% -6% preCompQR 35094/s 5% -- -1% preComp 35360/s 6% 1% --
      I'd be interested to see 5.8.0 results; the regex engine is one of the places that got tweaked quite a bit, I think.
      --
      Mike