For parsing HTML you are best off avoiding a regex. The reason for this is that HTML is not easy to parse, for example:

<!-- > A really funky image. --> <img src="light.gif" alt=">>LIGHT<<" /> <!-- was: <img src="light.jpg" alt="<light>" /> --> This is some text.

Because > and < can appear other then deliminating HTML tags, HTML parsing is probably best left off to HTML::TokeParser or HTML::Parser. For your case you might also want to look at HTML::TableExtract.

If you want to use your pattern, you can capture text using parenthesis, which will place the captured text in to the $<digit> variables, or in the result of the match in list context.

Note that your regex parses very differently from how you think it does. Here is the output of -MO=Deparse on it, modified to use m// instead of // so regexes stand out:

m/>\s+\w*</ | m/>\w*</ | m/>\w*</s + m//

I doubt this is the way you think it parses.

However, besides the fact it does not compile with those deliminators, your regex needs work to match the way you document it as matching. A straightforward translation of your specification would be:

if (/>(\s*[[:alnum:]]*)</) { my $matched = $1; # ... } else { # didn't match }

(Note that \w does not match just alphanumerics (it includes _) so I did not use it there. I also suspect you defined what you want to match incorrectly. update: I also excluded the 0 or more spaces after the "<" because it will always find at least 0 spaces.)

(update: minor rephrasing to make things make more sense.)


In reply to Re: Cropping the output of the pattern matcher by wog
in thread Cropping the output of the pattern matcher by jerrygarciuh

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.