Matching a long list of phrases

Hagbone has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Matching a long list of phrases by davorg (Chancellor) on Aug 14, 2006 at 15:07 UTC
One suggestion to increase maintainability would be to store the list of phrases in an array (which could be generated from a text file or even a database) and then build the regex on the fly. `my @phrases = ( 'phrase one', 'phrase two', 'a different phrase', 'yet another match', 'one more to match', 'single', 'word', 'matches' ); my $regex = join '\|', map { "\Q$_\E" } @phrases; my $regex = qr($regex); if ($incoming{text} =~ /$regex/) { # ... }` [download] Note the use of `\Q .. \E` to defang any metacharacters that might have sneaked into your list of phrases. I've also precompiled the regex (`qr()`) which may save some time if you're doing lots of matches. -- <http://dave.org.uk> "The first rule of Perl club is you do not talk about Perl club." -- Chip Salzenberg	[reply] [d/l]
Re: Matching a long list of phrases by Fletch (Bishop) on Aug 14, 2006 at 15:05 UTC
Maybe Regexp::Assemble or Regexp::Trie is what you're looking for?	[reply]
Re: Matching a long list of phrases by Sartak (Hermit) on Aug 14, 2006 at 16:55 UTC
As the Camel says, you should use `/Gandalf/ \|\| /Saruman/ \|\| /Radagast/` over `/Gandalf\|Saruman\|Radagast/` for efficiency. `my @phrases = map { qr/$_/ } # precompile regexes, may not be what you + want though ( 'phrase one', 'phrase two', 'a different phrase', 'yet another match', 'one more to match', 'single', 'word', 'matches', ); PHRASE: foreach my $phrase (@phrases) { last PHRASE if $incoming{'text'} =~ /($phrase)/; } if (defined $1) { # Matched phrase in $1. } else { # No match. }` [download]	[reply] [d/l] [select]
Re: Matching a long list of phrases by artist (Parson) on Aug 14, 2006 at 16:12 UTC
Using very fast Regexp::Trie pointed by fletch `use Regexp::Trie; my $rt = Regexp::Trie->new; open(IN, "phrases.txt"); @list = <IN>; for (@list){ chomp; $rt->add($_); } my $regexp = $rt->regexp; print "$regexp\n"; if ($incoming{text} =~ qr(\b$regexp\b)){ print "Matched: $& \n"; }` [download] --Artist	[reply] [d/l]
Re: Matching a long list of phrases by Leviathan (Scribe) on Aug 14, 2006 at 15:12 UTC
How about: `my $regex = join "\|", map { qr{$_} } split "\n", <<PHRASES; phrase one phrase two phrase three phrase moo PHRASES if ($incoming{'text'} =~ /$regex/) { DO WHAT NEEDS TO BE DONE HERE }` [download] -- Leviathan.	[reply] [d/l]
Re: Matching a long list of phrases by demerphq (Chancellor) on Aug 15, 2006 at 08:22 UTC
Others have addressed the issue of maintainability so ill just quickly add a word or two about efficiency. In older perls (such as every production quality perl build so far released) alternations are performed in O(N) time. What this means is that there is a good chance that `for (@patterns) { if ( $incoming{'text'} =~ /$_/ ) { .... } }` [download] will actually perform better than `if ( $incoming{'text'} =~ /$Big_Regex/ ) { .... }` [download] There are workarounds, such as the modules pointed out elsewhere in this thread, that allow $Big_Regex to be much more efficient, but regardless this type of pattern isn't handled particularly well by the perl regex engine. Things change in Perl 5.9.x however, and by the time 5.9.4 is released they will have changed a lot. The result of this is that in 5.9.4 and later the raw $Big_Regex as formed by something like `my $Big_Regex=join "\|",@patterns; $Big_Regex=qr/$Big_Regex/;` [download] will be handled in a much much more efficient way. And it will execute probably at least twice as fast as the same pattern would be if preprocessed by something like Regexp::Trie or Regexp::Assemble. The hope is that modules like these will be trained to do the right thing on later perls so that if you do decide to use one of them that when you go to a later version the modules automatically adjust as appropriate. Also note that if the string being searched is long and the number of patterns small that the loop over the possibilities is likely to be the fastest approach. The reason being that it will internally turn into something like `for (@patterns) { if ( instr($incoming{'text'},$_) > -1 ) { .... } }` [download] Since instr() uses Fast Boyer Moore matching this formulation is likely to be extremely fast, regardless of the perl version being used. Its only when the number of patterns gets large, or the length of the string gets long that the cost of multiple FBM searches will outweigh the cost of a single regex search. (This is because FBM doesnt necessarily look at all the characters in the string being searched to find a match.) Aside: As a last point, I thought I'd shine a light on an interesting subtlety of how 5.9.4 and later will behave differently from how the "equivelent" pattern as produced by Regexp::Assemble or friends will behave. When perl does the trie optimisation it respects the order that a word appears in the alternation sequence. Regex patterns as produced by optimiser will not. Consider the sequence /a\|abc\|ab/, we expect this to try to match 'a' then 'abc' then 'ab'. The pattern produced by an optimiser will be /a(?:bc?)?/ which will match 'abc' then 'ab' then 'a'. So not only will 5.9.4 perform better, it will also match pretty much exactly as pre-5.9.4, with the one exception that dupes will only be matched once, so /abc\|bc\|abc\|cd\|abc/ will perform the same as /abc\|bc\|cd/. Note however that the order is preserved, just that following dupes are ignored. --- $world=~s/war/peace/g	[reply] [d/l] [select]
Re: Matching a long list of phrases by lima1 (Curate) on Aug 14, 2006 at 16:12 UTC
To your question about the efficency: Because you want to check for phrases, not only words, you have to do N (size of your phrases list) comparisons. So it does not scale very well. Alternatively, you could create a SQLite database with two tables: one with all words (generated by a script) and one with all phrases (maintained by you). the first contains "links" to all phrases containing the word. What you now do is a kind of seed and extend search strategy: you test all words in your text. If the word is part of multiple phrases, you test all these phrases. if not, you have a single-word-phrase match or - if the word is not found - no phrase match. Note that this is only fast in practice if the phrase list is huge in comparison to the text size and the majority of phrases consist of few words only.	[reply]
Re: Matching a long list of phrases by Bro. Doug (Monk) on Aug 14, 2006 at 17:03 UTC
I'd just use grep. For a simple example, this runs: </bf> `my @phrases = qw{ one two three } ; my $in = 'three' ; if ( grep ( /$in/, @phrases ) ){ ... }` [download] Using grep, you can store to a list with qw and match to it. This eliminates the need for the messy \| and other such delimiters. If I find myself using delimiters, I like to use non-keyboard delimeters ( chr(1); #for example ). Have fun. Bro. Doug The Inert.	[reply] [d/l]
Re: Matching a long list of phrases by bangers (Pilgrim) on Aug 14, 2006 at 17:20 UTC
A hash or constant would be eficient and reasonably easy to maintain `use constant PHRASE => { 'phrase one' => 1, 'phrase two' => 1, 'a different phrase' => 1, 'yet another match' => 1, 'one more to match' => 1, 'single' => 1, 'word' => 1, }; if ( PHRASE->{$incoming{'text'}} ) { #DO WHAT NEEDS TO BE DONE HERE }` [download]	[reply] [d/l]
Re^2: Matching a long list of phrases by Sartak (Hermit) on Aug 14, 2006 at 17:40 UTC
The problem with this approach is that the incoming text may have extra text. Say one of the phrases was "Perl hacker". The OP wants to match "Just another Perl hacker," with the phrase "Perl hacker"	[reply]
Re: Matching a long list of phrases by ryantate (Friar) on Aug 17, 2006 at 21:13 UTC
Most people (with the exception of artist) have forgotten to include "\b" tags on each side of the phrase within the match regex. Without these anchors, the examples of "single" and "word" in the original question will match things like "singleton" and "sword". Which does not seem to be intended, since tho goal is to "phrase match."	[reply]