First of all, if you're a novice you should have the following code at the top of all your code:

use strict; use warnings;

This will help catch mistakes before they happen. Second, why aren't you using a module from CPAN to parse the HTML, i.e. HTML::TreeBuilder. You should never mess around with regular expressions on HTML. The original SGML specifications from which HTML is derived are pretty loose, which means for every rule there are a half dozen exceptions (or more!) which will render under most browsers even though it makes for a pain to parse. Not only will it make your code more robust, it will make your code much more intuitive to read, i.e.:

use strict; use warnings; use HTML::TreeBuilder; my $HTML_to_parse = shift (@ARGV); my $tree = HTML::TreeBuilder->new; $tree->parse($HTML_to_parse); $tree->eof; my @paragraph_tags = $tree->look_down('_tag', 'p'); foreach my $p (@paragraph_tags) { # note that this variable will "hide" the other # copy of @paragraph_tags and be garbage collected # as soon as it goes out of scope (the end of the # while loop) my @paragraph_tags = $p->look_down('_tag', 'p'); if (scalar (@paragraph_tags) == 1) { my $tag = shift (@paragraph_tags); my @contents = $tag->content_list; my $content = ""; foreach my $con (@contents) { # check that we have text and not an object $content .= $con unless (ref $con); } print $content; } }

Just to give you an idea of why using regular expressions to parse HTML is a bad idea, look at this:

<p class="foo">This is <p class="bar">HTML code using CSS Style sheets +.</p></p>

Now you have no contingencies for the class="" in your original regular expressions. So your code would break on a page that made use of attributes for any of the tags. HTML::TreeBuilder would take it in stride and let you access the attributes if you ever needed to use them using: my %attr = $node->all_external_attr;. So again, don't reinvent the wheel if you don't have to.


In reply to Re: nested tag matching by Vautrin
in thread nested tag matching by murugu

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.