Intrepid has asked for the wisdom of the Perl Monks concerning the following question:

Y.A.R.M. is "Yet Another Regex Mystery"

Over the years of reading PerlMonks postings I have seen that nearly every day someone posts a regex question, and now its finally my turn :-).

The source of my trouble with this could just as easily be heatstroke or childhood pesticide exposure, as any inherent obscurity to the problem, but, whatever the cause is, I cannot at the present moment figure this out without asking for some assistance.

The problem is this (in verbal description -- I've seen so many badly-asked regex questions, I hope I do better!): a string of arbitrary length comprising multiple sentences with (possible) line breaks (\n) has (possibly) some rudimentary mark-up in the form often used for various sorts of emphasis in e-mail and USENET postings:

That *doggone foolish Mabel* has toasted the _bread too long_ again.
The content within the * and _ characters are multiple words and I need to somehow achieve tokenization of the span of text inside, then (so that I can) make *each* *word* surrounded by the appropriate character:
That *doggone* *foolish* *Mabel* has toasted the _bread_ _too_ _long_ again.

Now to the Mystery part: the regex I have come up with only matches when the "markup" character used is "_" (underscore, which I'll note is not a Perl-type regex metacharacter, but instead a simple alphanumeric matched by <SAMP>\w</SAMP>), not when it is "*"! WHY? This one-liner illustrates the problem and contains my regex:

perl -e '$gh = join qq[],(<STDIN>); if ($gh =~ m@(\b(\*|_)\S+\b)(.+?)(\b\S+\2\b)@s) {print join q[ ],$1, +$3,$4,q[ ];}' Happy _puppy life good_ yeah.
(there will be breaks in the line above that must be removed for testing as a "one-liner", obv.)

The output I get is this:

_puppy   life   good_
But if I use "*" instead, I get no output.

What is going here? (I am testing in <CITE>bash</CITE> on Cygwin, the UNI* emulation environment for Win32).

Thanks.     Soren

Updated:

12 Jan 2004 - just removed old crufty markup I used to use in PM posts to adjust the font size.

Replies are listed 'Best First'.
Re: Y.A.R.M.
by John M. Dlugosz (Monsignor) on Aug 16, 2001 at 08:39 UTC
    It has to do with \W matching _ but not *.

    After all, \b will have a non-letter on the left (the space) so the next char (*|_) has to be a letter, or the \b assertion won't match.

    —John

Re: Y.A.R.M.
by tachyon (Chancellor) on Aug 16, 2001 at 10:42 UTC

    \b is true at a word boundary whereas \B is true when not at a word boundary. Here is an example that does as you want and demonstrates the difference.

    $_ = '*that old time* music *kinda soothes* my soul'; s{ # sub ( # capture in $1 (\B[^\w\s]\b) # capture token in $2 at boundary [^\2]+? # 1 or more non tokens (minimum of) \b\2\B # matching token at boundary ) # close capture $1 } # close the first part of the regex # now generate the substitution pattern { $fnd = $1; # store what we have found $tok = $2; # this is our token to add $fnd =~ s/ # now modify $fnd (\w) # find a letter -> $1 (\s+) # space(s) -> $2 (\w) # another letter-> $3 # now add the tokens we want /$1$tok$2$tok$3/xg; $fnd # $fnd will now get substituted into our string }gex; print;

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Re: Y.A.R.M.
by htoug (Deacon) on Aug 16, 2001 at 10:14 UTC
    The problem lies in the "\b" at the start and end of your regex.

    See:

    $_ = "ab_"; print "\\b matches\n" if m!ab_\b!; $_ = "ab*"; print "\\b matches\n" if m!ab\*\b!;
    Only the first matches.

    So if you change your regex to:

    m@((\*|_)\S+\b)(.+?)(\b\S+\2)@s
    it seems to work.
Re: Y.A.R.M.
by physi (Friar) on Aug 16, 2001 at 10:31 UTC
    Cause I don't know anything about  \b in a regexp, I came along with this:
    perl -e '$_="*I do not know* any _perl_ish sentence";while (/(\*|_)(.* +?)(\1)/sg){print join q[ ],$1,$2,$3}'
    ----------------------------------- --the good, the bad and the physi-- -----------------------------------