Re: Regular Expression

The problem with your approach is that it will fail if someone changes the case of the tags or your tags have attributes. Let's say you have a table inside of another table and the first start table tag has attributes (or the tag is spread over more than one line), then your routine might match at the second table tag down to the last table tag, thus resulting in imbalanced tags.

For this, use a parser. In this case, since it appears that you want to remove tables inside of tables, you'll also want to keep track of the number of table start and end tags to ensure that you are properly balancing them. Here's a quick hack for you.

#!/usr/bin/perl -w
use strict;
use HTML::TokeParser::Simple 1.4;

my $parser = HTML::TokeParser::Simple->new( *DATA );

my $html = '';
my $not_balanced = 0;

while ( my $token = $parser->get_token ) {
  $html .= $token->as_is unless $not_balanced;
  if ( $token->is_tag('table') ) {
    $not_balanced += $token->is_start_tag ? 1 : -1;
    # ugh, I don't like the double negative
    $html .= $token->as_is if $token->is_end_tag and ! $not_balanced;
  }
}
print $html;

__DATA__
<p>one</p>
<table>
  <tr>
    <td>
      <table>
        <tr>
          <td>test</td>
        </tr>
      </table>
    </td>
  </tr>
</table>
<p>two</p>
[download]

Cheers,
Ovid

New address of my CGI Course.
Silence is Evil (feel free to copy and distribute widely - note copyright text)

Comment on Re: Regular Expression Download Code