kprice++ has asked for the wisdom of the Perl Monks concerning the following question:

So, first off, this is for a homework assignment. The problem involves taking an arbitrarily long list of REGEX strings and searching for them in a file. Would it be better, or faster, to search each line in the file, one string at a time or with alternation...(str1|str2).

my $rexs = '(' . join('|', @regexs) . ')'; #... if(/$rexs/) {...} #or.. foreach $rex (@regexs) { if(/$rex/) {...} }

Replies are listed 'Best First'.
Re: Alternation vs. looping for multiple searches.
by roboticus (Chancellor) on Nov 21, 2010 at 03:16 UTC

    kprice:

    It depends whether you want to find all matches or if you're happy with finding any matches. With alternation, once you find one, then it stops looking, and won't find the rest. If you're satisfied with that, then by all means use it. If you need to find *each* match, then alternation could easily become troublesome:

    Roboticus@Roboticus-PC ~ $ cat regex_alt.pl use strict; use warnings; my $text=<<EOT; Now is the time for all good men to come to the aid of their party. EOT print "ORDER MATTERS FOR COLLISIONS:\n"; while ($text=~/(the|their)/g) { print "\tfound $1 at $-[0], $+[0].\n"; } print "VS:\n"; while ($text=~/(their|the)/g) { print "\tfound $1 at $-[0], $+[0].\n"; } Roboticus@Roboticus-PC ~ $ perl regex_alt.pl ORDER MATTERS FOR COLLISIONS: found the at 7, 10. found the at 44, 47. found the at 55, 58. VS: found the at 7, 10. found the at 44, 47. found their at 55, 60.

    Here, you see that if you're not careful in building your regex from a collection, that you can have problems. If you put 'the' before 'their' in your alternation, you'll never match 'their'. In an automated system, you can't simply go by string length, and put the shorter strings after the longer strings. For example, if one of your regex strings was "t[ho]e", then placing it before "the" would still prevent you from matching the second one.

    Now all of this may or may not matter depending on your requirements. (After all, if "the" can match either expression, is it significant which one matched?) My point is simply that you'll need to think about things before simply building an alternation...

    ...roboticus

      Sorry, I should have been much more specific. The problem is to find a set of regex's that would mark blocks of text. So it doesn't matter how many are on a line, or where. Once one regex is matched, all lines following, up to and including the next match, are considered within the block of code. The lines within blocks are to be returned to standard output.

      I have two working scripts, one that uses alternation, and one that uses a loop. The way my teacher is, it's usually apparent what he expects that doesn't make it into the spec, and it seems to me, he intends for there to be a loop. But, I don't see the problem with using just regex. Would it be slower than having a loop that searches one regex at a time and can stop once it finds one? Is that even a relevant question?

      106 ^L[18:27:43 kelly@kudzu lab6]$ cat 4.pl 107 #! /usr/bin/perl 108 # 109 # CIS 33A Programming In Perl 110 # Lab #6.4 111 # Due Date: Monday, Nov 22 112 # Written By: Kelly Price 113 # 114 use strict; 115 use warnings; 116 117 die "Usage: $0 infile outfile regexs...\n" if @ARGV < 3; 118 open(FIN, "$ARGV[0]") or die "Error openning $ARGV[0] for readin +g: $!\n"; 119 open(FOUT, ">$ARGV[1]") or die "Error openning $ARGV[1] for writin +g: $!\n"; 120 121 my (@regexs) = splice(@ARGV, 2); 122 my $rexs = '(' . join('|', @regexs) . ')'; 123 124 my $inblock = 0; 125 print FOUT grep { 126 my $match = 0; 127 if (/$rexs/) 128 { 129 $inblock = $inblock ? 0 : 1; 130 $match = 1; 131 } 132 $inblock or $match; 133 } <FIN>; 134 135 [18:27:46 kelly@kudzu lab6]$ 4.pl 4.pl 4.out \\{ \\} 136 [18:28:00 kelly@kudzu lab6]$ cat 4.out 137 print FOUT grep { 138 my $match = 0; 139 if (/$rexs/) 140 { 141 } 142 $inblock or $match; 143 } <FIN>;
Re: Alternation vs. looping for multiple searches.
by PeterPeiGuo (Hermit) on Nov 21, 2010 at 02:21 UTC

    Those two solutions don't produce the same results, and the one with join is logically wrong.

    Peter (Guo) Pei

      I should have been more specific. The assignment is poorly spec'ed anyway.
Re: Alternation vs. looping for multiple searches.
by JavaFan (Canon) on Nov 21, 2010 at 14:54 UTC
    If the patterns are simple (patterns with no characters having any special meaning) I expect the loop to be faster. This is because the simpler the string, the more chance there is the optimizer does all, or most, of the work.

    For more complicated patterns, it will depend on several things; to name a few:

    • Is there a match?
    • Which of the alternations match?
    • Where in the string will it match?
    • How long is the string?
    • How many alternations are there?
    • How complicated are the alternations?
    Whether a loop or a pattern with alternations is faster will depend from case to case. The only way to make sure is to run a benchmark, and even then you only have an answer for a specific set of patterns, a specific set of data, a specific version of Perl (version-number, compilation options, compiler used), and a specific OS. Change one of these things, and you may get a different result.
Re: Alternation vs. looping for multiple searches.
by chrestomanci (Priest) on Nov 21, 2010 at 13:49 UTC

    If you want to know which will be faster, then you need to benchmark. My gut feeling is that alternation will be faster than loops, but I could easily be wrong. There will also be details in exactly what you are searching for that will affect performance. For example, if your are looking for the words "fox", "fred" or "frodo", then the regular expression /f(r((odo)|(ed)))|(ox)/ might be faster than a simpler matcher for each word separately. On the other hand perhaps the perl regular expression engine will do that optimisation for you, so writing the more complex regex above will make difference.

    One important tip, is that if the text you are searching for (and therefore the regular expression) is not know at perl compile time, then you can explicitly compile the regular expression at runtime, which will improve performace. Do that with "qr" brackets: my $compiled_re = qr/pattern/

    You also asked which is better. That depends on a whole heap of factors, but if readability and maintainability is more important than performance, then forget what I just said, and write a simple algorithm that is easy for anyone to read and understand, even if it is a little slower.

      One important tip, is that if the text you are searching for (and therefore the regular expression) is not know at perl compile time, then you can explicitly compile the regular expression at runtime, which will improve performace.
      qr gains in performance if it helps compiling the same pattern multiple times. It has nothing at all to do whether a pattern is known at perl compile time or not.
Re: Alternation vs. looping for multiple searches.
by aquarium (Curate) on Nov 21, 2010 at 22:15 UTC
    there's always another way...you could construct and then eval code that tests the regexes one by one in sequence, shortcircuiting on first match found. you'd usually arrange to have the most common case first and least common case last, so that it short circuits at earliest opportunity.
    /$regex1/ or /$regex2/ or /$regex3/......etc
    the hardest line to type correctly is: stty erase ^H