Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

If I have a pattern \*test\* it matches string *test*. If the pattern is \btest\b it matches 'a *test* b', But the pattern \b\*test\*\b does not match 'a *test* b'. I would have thought it would match. Should't the \b\* match the leading ' *', test match test and the trailing \*\b match the trailing '* '?

Replies are listed 'Best First'.
Re: regex question
by davido (Cardinal) on Sep 24, 2003 at 18:40 UTC
    The issue is white space. If the string is as follows:

    my $string = "a *test* b";

    With the regexp, /\b\*test\*\b/ you get a failure because the junction between a space character and an asterisk (\*) is not a word boundry, because neither space nor \* is a word character (\w). You need to rearrange the regexp one of a couple of ways, depending on what you're trying to accomplish:

    m/\b\s\*test\*\s\b/; or m/\*\btest\b\*/;
    However, the second example doesn't need the \b, because it is implicit in the fact that you're already specifying non-word characters (*) that must preceed and follow the literal word characters of 'test'.

    The first example (probably what you really should be using) says, match where a word has just ended, followed by a space, followed by *test*, followed by a space, followed by a word beginning. That's what the \b is doing for you. Of course \b is a zero width assertion, so you haven't actually matched what comes before the first \b or what comes after the second \b, you're just asserting that there must be a word there.

    Note, I used \s to indicate space, but if you want to use an actual space (ascii 32), put a space in your regexp, or spell it out with its ascii value. I used \s so that the whitespace would be clearly visible in reading the regexp.

    Hope this helps...

    Dave

    "If I had my life to do over again, I'd be a plumber." -- Albert Einstein

      Thanks. the problem is that I really want to match *test* when it is on a word boundary. So, I want to match /*test*/ for example, or --*test-- That's why I thought \b\*test\*\b would do the trick. I thought that \b should stop matching if it finds the end of a word boundary or a character that is explicitly specified. So, since * is part of a word boundary, but is also the next exact match in the regex, I thought \b would stop matching. Guess \b sucks anything that matches, even if it is explicitly specified as something to be matched.
        Now you've lost me. You earlier said that you wanted your regexp to match the literal string '*test*'. And you provided a regexp with /\b\*test\*\b/, thus spelling out the absolute need for an asterisk to preceed and follow the word test, in order for the match to occur.

        But now you've said that you want to match both '*test*' and '--*test--'. What made you think that '--*test--' would match against a regexp that specifies '\*test\*'?

        Also, \b is a zero width assertion that specifies that there must be a word boundry at that particular position. A word boundry is the point where 'word characters' and 'non-word-characters' meet. There is no word character on either side of ' *test* ' at the position your original regexp place boundry assertions, and that's why your regexp fails. You went looking for a word boundry at the junction between a space character and an asterisk, in your original question. That's not a word boundry. A word boundry is, again, a "zero width assertion". '*' is not part of a word boundry. '*' is, if next to a word character, the non-word character that creates a word boundry in the zero-width space between the word character and the asterisk. But word boundries don't have a part; they don't consume a character. \b doesn't suck anything in.

        Perhaps what you are saying is that you want 'test' to match as long as it is surrounded by a word boundry. That's easy. However, the following example will also match at the beginning of the string even if nothing comes before it, because the beginning of the string can be a word boundry too:

        $string = '--*test--'; if ( $string =~ /\btest\b/ ) { print "$string matched.\n" }

        If you want to match both the word test, and the actual non-word characters, which themselves are required to be there, that preceed and follow it, that's also easy:

        my $string = "--*test--"; if ( $string =~ /\W+test\W+/ ) .....

        Here there's really no need for the \b, because a word boundry is implicit in the fact that you've said that one or more non-word characters must preceed and follow the word 'test'.

        I'm still a little foggy on what you're saying in your followup question; it redefines the problem to a degree, and actually has unresolvable conflicts within its own assertions.

        I really think that you would benefit by having a look at the appropriate perldocs: perlrequick, perlretut, perlre, and the FAQ on Regular Expressions, perlfaq6. If you have Perl, you have those documents. I know it looks like a lot of reading, but the time I've taken in trying to compose a consciencious answer to your question is about equal to the time it would take you to read a couple of those documents in their entirety yourself. You can appreciate my frustration when after putting together a thorough and complete answer yesterday, your followup question today changes everything, and is still ambiguous, conflicted, and vague. Why did I bother in the first place if you're not going to do a little homework yourself?

        Dave

        "If I had my life to do over again, I'd be a plumber." -- Albert Einstein