Re: Matching a long list of phrases

Others have addressed the issue of maintainability so ill just quickly add a word or two about efficiency. In older perls (such as every production quality perl build so far released) alternations are performed in O(N) time. What this means is that there is a good chance that

  for (@patterns) {
     if ( $incoming{'text'} =~ /$_/ ) { 
         ....
     }
  }
[download]

will actually perform better than

     if ( $incoming{'text'} =~ /$Big_Regex/ ) { 
         ....
     }
[download]

There are workarounds, such as the modules pointed out elsewhere in this thread, that allow $Big_Regex to be much more efficient, but regardless this type of pattern isn't handled particularly well by the perl regex engine.

Things change in Perl 5.9.x however, and by the time 5.9.4 is released they will have changed a lot. The result of this is that in 5.9.4 and later the raw $Big_Regex as formed by something like

  my $Big_Regex=join "|",@patterns;
  $Big_Regex=qr/$Big_Regex/;
[download]

will be handled in a much much more efficient way. And it will execute probably at least twice as fast as the same pattern would be if preprocessed by something like Regexp::Trie or Regexp::Assemble. The hope is that modules like these will be trained to do the right thing on later perls so that if you do decide to use one of them that when you go to a later version the modules automatically adjust as appropriate.

Also note that if the string being searched is long and the number of patterns small that the loop over the possibilities is likely to be the fastest approach. The reason being that it will internally turn into something like

  for (@patterns) {
     if ( instr($incoming{'text'},$_) > -1 ) { 
         ....
     }
  }
[download]

Since instr() uses Fast Boyer Moore matching this formulation is likely to be extremely fast, regardless of the perl version being used. Its only when the number of patterns gets large, or the length of the string gets long that the cost of multiple FBM searches will outweigh the cost of a single regex search. (This is because FBM doesnt necessarily look at all the characters in the string being searched to find a match.)

Aside: As a last point, I thought I'd shine a light on an interesting subtlety of how 5.9.4 and later will behave differently from how the "equivelent" pattern as produced by Regexp::Assemble or friends will behave. When perl does the trie optimisation it respects the order that a word appears in the alternation sequence. Regex patterns as produced by optimiser will not. Consider the sequence /a|abc|ab/, we expect this to try to match 'a' then 'abc' then 'ab'. The pattern produced by an optimiser will be /a(?:bc?)?/ which will match 'abc' then 'ab' then 'a'. So not only will 5.9.4 perform better, it will also match pretty much exactly as pre-5.9.4, with the one exception that dupes will only be matched once, so /abc|bc|abc|cd|abc/ will perform the same as /abc|bc|cd/. Note however that the order is preserved, just that following dupes are ignored.

---
$world=~s/war/peace/g

Comment on Re: Matching a long list of phrases Select or Download Code