TheDoc has asked for the wisdom of the Perl Monks concerning the following question:

I have this script where in it it's supposed to check a list of words against a sentence to see if any of the items match. It only accepts the first match too. My problem is that is I break up everything in the sentence into individual words the phrases will never match, but the way it is now it finds things like "hi" inside the word "whiny". Is there any way to make it able to catch something like "aww yeah" in the sentence "we all went aww yeah to that" but not get the "hi" in "whince"?

Replies are listed 'Best First'.
Re: word matching
by jonadab (Parson) on Oct 03, 2003 at 20:21 UTC

    Yeah, use word boundary assertions, \b where the word boundaries should occur.


    $;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b->()}} split//,".rekcah lreP rehtona tsuJ";$\=$ ;->();print$/
Re: word matching
by dragonchild (Archbishop) on Oct 03, 2003 at 20:34 UTC
    Posting the offending snippet is often very helpful. That way, we can point to specific issues and offer specific advice. Often, monks will post snippets back to you that you can cut'n'paste into your code.

    ------
    We are the carpenters and bricklayers of the Information Age.

    The idea is a little like C++ templates, except not quite so brain-meltingly complicated. -- TheDamian, Exegesis 6

    Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

Re: word matching
by davido (Cardinal) on Oct 04, 2003 at 06:16 UTC
    To elaborate on the "word boundry" discussion:

    Your problem seems to be that "to" matches "to" and "totally" and "tonight", etc., when you really only want it to match "to".

    The key is to tell the regexp engine that you want to match "to" only when it occurs on a word boundry. In fact, you want a word boundry at both sides of it.

    Word boundries are zero-width assertions. They don't 'match' anything in particular, but they assert that matches can only occur if the assertion's criteria are met.

    The criteria that the '\b' assertion requires is that there be a transition between a \w character and a \W (nonword) character.

    Here is an example:

    my $string = "Tonight too many to's were used to explain."; my (@matches) = $string =~ /\b(to)\b/g; print($_,"\n") foreach @matches;
    With the \b assertions you only match twice. Once on the "to" that is part of "to's" and the other time on the literal word "to".

    Without the \b assertions, you would match four times: "to"night, "to"o, "to"'s, "to".


    Dave


    "If I had my life to do over again, I'd be a plumber." -- Albert Einstein
      thanks! the word boundary stuff did the trick. sorry about not posting the original code, I'll remember next I need help.