bluethundr has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I am in the Llama chapter 8 and I am learning about quantifiers. I am a novice. I am trying to answer the extra credit exercise #4 in chapter 8.

The question states:

Modify the program from the previous exercise so that immediately following the word ending in 'a' it will capture up to five characters (if there are that many characters) in a separate memory variable. Updte the code to display both memory variables. For example, if the input string says "I saw Wilma yesterday, the up to five characters are _yest. If the input is I, Wilma!, the extra memory should have one character. Does your pattern still match just plain wilma?

Here is my answer:

--------------

#!/usr/bin/perl -w use strict; while(<>) { chomp; if (/(\w+a\b)(\w{0,5})/) { print "Matched: |$`<'$1'>'$2'\n"; } else { print "No match: |$_|\n"; } }
-------------

Here is my sample text that I am working with: This is a line that has the name fred in it. This is a line that does not. This is a line that has the name barney in it. This is fred. This is notfred. Hello Fred Flintstone! Hello Fred! Allo allo, Alfred! Cheerio Mr. frederick! Hello Mr. Slate! Hello FRED! Hello Fred! Hello fred! This is some wilma text with some words in it this is fred. This is wilmafred. This is some more wilma text fred. Mrs. Wilma Flintstone wilma&fred wilma but not barney


What I expect to be in the memory variable $2 always turns up empty. Help a brutha out?

Replies are listed 'Best First'.
Re: Contents of $2 empty
by ww (Archbishop) on Dec 22, 2008 at 14:51 UTC
    Here are a few general guidelines which you may find useful (followed by code which incorporates them):
    • KISS! Be specific when you can, to avoid false positives. In this case, using a character class to find "Wilma" | "wilma" lends itself to all manner of false positives... for example "Selma."
    • Be wary of \b and friends: they won't do what you expect unless you are very clear about the meanings of \w and \W. ikegami explains your specific problem, above.
    • Do learn to use regex modifiers such as /.../i to help you simplify your regexen.

    So, using your data (and adding the "... saw Wilma yesterday" part of the exercise statement):

    while(<DATA>) { chomp; if ( /(Wilma)([\w\W]{5})/i ) { print "Matched: |<$1> <$2>\n"; } else { print "No match: |$_|\n"; } } __DATA__ This is a line that has the name fred in it. This is a line that does not. This is a line that has the name barney in it. This is fred. This is notfred. Hello Fred Flintstone! Hello Fred! Allo allo, Alfred! Cheerio Mr. frederick! Hello Mr. Slate! Hello FRED! Hello Fred! Hello fred! This is some wilma text with some words in it this is fred. This is wilmafred. This is some more wilma text fred. Mrs. Wilma Flintstone wilma&fred wilma but not barney I saw Wilma yesterday.

    Output:

    perl 732070.pl No match: |This is a line that has the name fred in it.| No match: |This is a line that does not.| No match: |This is a line that has the name barney in it.| No match: |This is fred.| No match: |This is notfred.| No match: |Hello Fred Flintstone!| No match: |Hello Fred!| No match: |Allo allo, Alfred!| No match: |Cheerio Mr. frederick!| No match: |Hello Mr. Slate!| No match: |Hello FRED!| No match: |Hello Fred!| No match: |Hello fred!| Matched: |<wilma> < text> Matched: |<wilma> <fred.> Matched: |<wilma> < text> Matched: |<Wilma> < Flin> Matched: |<wilma> <&fred> Matched: |<wilma> < but > Matched: |<Wilma> < yest>

    Caveat: Neither my Llama nor my Camel edition has the exercise you describe, so this may miss a requirement that [Ww]ilma be followed by a space or by a punctuation mark or symbol, which would substantially change the regex required.

    Update: ...and, oh yes, I've simplified your print "Matched.... as well, again in the pursuit of KISS.

Re: Contents of $2 empty
by ikegami (Patriarch) on Dec 22, 2008 at 14:06 UTC
    \b matches the spot between \w\W, \W\w, \A\w or \w\z. The only thing a\b\w{0,5} can possibly match is "a".

    the up to five characters are _yest

    You mean " yest"? The space doesn't match \w. You probably want "." instead of "\w"

Re: Contents of $2 empty
by jeanluca (Deacon) on Dec 22, 2008 at 14:19 UTC
    Just one more little thing. You're not really interested in capturing the 'a' so you could change your regexp into
    /(?:a\b)(.{0,5})/ /(?<=a\b)(.{0,5})/

    Cheers
    LuCa