newbio has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, My question is about a faster way to do phrase matching in a sentence. + The usual standard way would be: $sentence="Jack and Jill went up the hill to fetch a pail of water."; $sentence=~/pail of water/\#pail of water\#/g; #tag the phrase 'pail o +f water' in the sentence. However, the above method becomes too slow when I have a long list of +phrases and also sentences. If I have single words instead of phrases +, I could use hash matching on 'split' sentence words, such as: @temp=split(' ', $sentence); my %hash=("pail of water"=>1); foreach my $i (@temp) { if ($hash{$i}) { --- } } However, this method cannot be applied to phrases. Can something simil +ar/faster be used for phrase tagging of sentences? On a related note, how to club the words of the marked phrase in the s +entence as a single unit (i.e. the phrase appears in a single array e +lement when the split is done on the sentence on the space character) +. Thanks.

Replies are listed 'Best First'.
Re: phrase marking
by kyle (Abbot) on Sep 09, 2008 at 16:42 UTC

    Please see Writeup Formatting Tips (You should not have <code> tags around your whole node.)

    With many phrases, I'd probably do something like this:

    my @phrases = ( 'pail of water', 'pale horse' ); my $phrases_re = join '|', map { quotemeta } @phrases; foreach my $sentence ( @sentence_source ) { $sentence =~ s/($phrases_re)/\#$1\#/g; }

    Thanks to JadeNB for pointing out that I'd swapped "pale" and "pail".

      Nice solution, Monks. Thanks a lot. Hi Kyle, I am trying to understand the procedure of your method. I observe that the phrases that get marked in the sentence depend on their order in the array 'phrases'. The sentence is scanned from left to right and wherever it finds a match 'first' in the array order of phrases it selects that phrase. So, in case of an overlap, the phrase that appears first in the 'phrases' array is given the priority. Please correct me if I am wrong. Thanks.

        Yes, I think that's right. If you have overlapping phrases to mark, you'd have to figure out how to you want to deal with those, and then you'd probably have to use a different solution. If you have some phrases that are preferred over others, you can order them before building the expression.

Re: phrase marking
by JadeNB (Chaplain) on Sep 09, 2008 at 17:05 UTC
    For your first question, you could avoid the regex match entirely by using substr and index, writing
    CHUNK: while ( ( my $pos = pos $sentence ) < length $sentence ) { for my $phrase ( @phrases ) { if ( my $index = index($sentence, $phrase ) >= $pos ) { my $length = length $phrase; substr($sentence, $index + $length, 0, '#'); substr($sentence, $index, 0, '#'); pos $sentence = $index + $length + 2; next CHUNK; } } last CHUNK; }
    For the second question, split has no way of knowing what you want the stuff between the separators to be; but you could make the phrases themselves the 'separators', and capture them:
    my @split = split /($phrase1|$phrase2)/, $sentence; @phrases = @split[map { 2 * $_ + 1 } 0 .. ($#split - 1)/2];
    This works because leading empty fields are preserved, so the separators are in the odd-indexed fields. I'm sure there's a better way to get just the odd-indexed fields, though.

    UPDATE: Fixed my code to put # after as well as before. Also, I misunderstood your original question to mean that the regex matching itself was too slow, rather than that making up a bunch of regexes was too slow. kyle's solution is probably faster.

Re: phrase marking
by toolic (Bishop) on Sep 09, 2008 at 17:19 UTC
    how to club the words of the marked phrase in the sentence as a single unit (i.e. the phrase appears in a single array element when the split is done on the sentence on the space character).
    If you don't mind temporarily mangling your marked sentence (and don't already have quotes in your sentence), you could use Text::ParseWords.
    use strict; use warnings; use Text::ParseWords; use Data::Dumper; my $marked = 'Jack and Jill went #up the hill# to fetch a #pail of wat +er#.'; $marked =~ s/#/"/g; my @words = shellwords($marked); print Dumper(\@words); __END__ $VAR1 = [ 'Jack', 'and', 'Jill', 'went', 'up the hill', 'to', 'fetch', 'a', 'pail of water.' ];