comment on

First of all, if you're a novice you should have the following code at the top of all your code:

use strict;
use warnings;
[download]

This will help catch mistakes before they happen. Second, why aren't you using a module from CPAN to parse the HTML, i.e. HTML::TreeBuilder. You should never mess around with regular expressions on HTML. The original SGML specifications from which HTML is derived are pretty loose, which means for every rule there are a half dozen exceptions (or more!) which will render under most browsers even though it makes for a pain to parse. Not only will it make your code more robust, it will make your code much more intuitive to read, i.e.:

use strict;
use warnings;
use HTML::TreeBuilder;
my $HTML_to_parse = shift (@ARGV);
my $tree = HTML::TreeBuilder->new;
$tree->parse($HTML_to_parse);
$tree->eof;

my @paragraph_tags = $tree->look_down('_tag', 'p');
foreach my $p (@paragraph_tags) {
  # note that this variable will "hide" the other
  # copy of @paragraph_tags and be garbage collected
  # as soon as it goes out of scope (the end of the
  # while loop)
  my @paragraph_tags = $p->look_down('_tag', 'p');
  if (scalar (@paragraph_tags) == 1) {
    my $tag = shift (@paragraph_tags);
    my @contents = $tag->content_list;
    my $content = "";
    foreach my $con (@contents) {
      # check that we have text and not an object
      $content .= $con unless (ref $con);
    }
    print $content;
  }
}
[download]

Just to give you an idea of why using regular expressions to parse HTML is a bad idea, look at this:

<p class="foo">This is <p class="bar">HTML code using CSS Style sheets
+.</p></p>
[download]

Now you have no contingencies for the class="" in your original regular expressions. So your code would break on a page that made use of attributes for any of the tags. HTML::TreeBuilder would take it in stride and let you access the attributes if you ever needed to use them using: my %attr = $node->all_external_attr;. So again, don't reinvent the wheel if you don't have to.

In reply to Re: nested tag matching by Vautrin
in thread nested tag matching by murugu

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.