Don't try and do it with a regex. Nested structures are very hard to do with REs. Nested HTML is probably as hard as it gets. Here is how to do it right using HTML::Parser. Yes the Version 2 API takes a little getting used to but is very easy to use once you get your head around it. All we do is increment a counter when we find a opening table tag and decrement it when we find a closing tag. If we have a value > 0 in the counter we are in a table so don't add the original text to our data. If we have a value of 0 we are outside of the tables so add the origtext.
{
package MyParser;
use base 'HTML::Parser';
sub start {
my($self, $tagname, $attr, $attrseq, $origtext) = @_;
$self->{table}++ if $tagname eq 'table';
$self->{data} .= $origtext unless $self->{table};
}
sub end {
my($self, $tagname, $origtext) = @_;
$self->{data} .= "</$tagname>" unless $self->{table};
$self->{table}-- if $tagname eq 'table';
}
sub text {
my($self, $origtext, $is_cdata) = @_;
$self->{data} .= $origtext unless $self->{table};
}
sub comment {
my($self, $origtext ) = @_;
#$self->{data} .= $origtext if $want_comments
}
}
my $p = MyParser->new;
$p->parse_file(*DATA);
$data = $p->{data};
print $data;
__DATA__
<html>
<head>
<title></title>
</head>
<body>
<p>Hello
<tablE>
.....
</taBle>
<p>World
<table>
<tr><TABLE> Nested
<table> Nested some more </table>
</table>
</tr>
</table >
<p>REs can be useful, but HTML parser rocks!
</body>
</html>
cheers
tachyon
s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print
|