Y.A.R.M. is "Yet Another Regex Mystery"

Over the years of reading PerlMonks postings I have seen that nearly every day someone posts a regex question, and now its finally my turn :-).

The source of my trouble with this could just as easily be heatstroke or childhood pesticide exposure, as any inherent obscurity to the problem, but, whatever the cause is, I cannot at the present moment figure this out without asking for some assistance.

The problem is this (in verbal description -- I've seen so many badly-asked regex questions, I hope I do better!): a string of arbitrary length comprising multiple sentences with (possible) line breaks (\n) has (possibly) some rudimentary mark-up in the form often used for various sorts of emphasis in e-mail and USENET postings:

That *doggone foolish Mabel* has toasted the _bread too long_ again.
The content within the * and _ characters are multiple words and I need to somehow achieve tokenization of the span of text inside, then (so that I can) make *each* *word* surrounded by the appropriate character:
That *doggone* *foolish* *Mabel* has toasted the _bread_ _too_ _long_ again.

Now to the Mystery part: the regex I have come up with only matches when the "markup" character used is "_" (underscore, which I'll note is not a Perl-type regex metacharacter, but instead a simple alphanumeric matched by <SAMP>\w</SAMP>), not when it is "*"! WHY? This one-liner illustrates the problem and contains my regex:

perl -e '$gh = join qq[],(<STDIN>); if ($gh =~ m@(\b(\*|_)\S+\b)(.+?)(\b\S+\2\b)@s) {print join q[ ],$1, +$3,$4,q[ ];}' Happy _puppy life good_ yeah.
(there will be breaks in the line above that must be removed for testing as a "one-liner", obv.)

The output I get is this:

_puppy   life   good_
But if I use "*" instead, I get no output.

What is going here? (I am testing in <CITE>bash</CITE> on Cygwin, the UNI* emulation environment for Win32).

Thanks.     Soren

Updated:

12 Jan 2004 - just removed old crufty markup I used to use in PM posts to adjust the font size.


In reply to Y.A.R.M. by Intrepid

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.