Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi there!! I am having a little big problem here trying to match everything that is inside of the TABLE tag and substitute all it's content with nothing, the code that I am using seems to match only if the string had no new line, as soon as a new line is found the regular expression doesn't match any more, any help on how I could repair this error?
Here is the part of the text file I am trying to work with:
<tr> <td colspan=3> <table border="0" cellspacing="0" cellpadding +="2" bgcolor="#ffffff"> <tr> <td width="100%" align="left"><strong> <a href='http://www.webtesting.com/cgi-b +in/test.cgi> <!--&user=default --><!--a href='test.pl +?&user=default&theme=default' --> <font size="2"><small>Click here to mak +e this your default!</small></font></a> </strong></td> </tr> </table> <form> <input type="button" name="Submit" onClick +="top.opener.location.reload(1); top.close()" value="GO!" style="font +:9pt arial; background:silver"></p> </form> </td> </tr> </table> </body> </html>
So I am trying to find everything that is inside of the table tags ans get rid of it.
Here is the perl code
open(FILE, "$text_file") || print "Can't open output file1: $text_file +\n"; while(<FILE>) { if ($_=~/<table>(.*)<\/table>/sg) {$_=~s/<table>(.*)<\/table>//s +g;} $save=$save.$_; } my $file2 = "search_test2.txt"; open(DATA_OUT, ">$file2") || print "Can't open output file1: $file2\n +"; print DATA_OUT $save; close FILE; print $save;
Thanks!!!

Replies are listed 'Best First'.
Re: Regular Expression
by Ovid (Cardinal) on Jan 02, 2003 at 16:49 UTC

    The problem with your approach is that it will fail if someone changes the case of the tags or your tags have attributes. Let's say you have a table inside of another table and the first start table tag has attributes (or the tag is spread over more than one line), then your routine might match at the second table tag down to the last table tag, thus resulting in imbalanced tags.

    For this, use a parser. In this case, since it appears that you want to remove tables inside of tables, you'll also want to keep track of the number of table start and end tags to ensure that you are properly balancing them. Here's a quick hack for you.

    #!/usr/bin/perl -w use strict; use HTML::TokeParser::Simple 1.4; my $parser = HTML::TokeParser::Simple->new( *DATA ); my $html = ''; my $not_balanced = 0; while ( my $token = $parser->get_token ) { $html .= $token->as_is unless $not_balanced; if ( $token->is_tag('table') ) { $not_balanced += $token->is_start_tag ? 1 : -1; # ugh, I don't like the double negative $html .= $token->as_is if $token->is_end_tag and ! $not_balanced; } } print $html; __DATA__ <p>one</p> <table> <tr> <td> <table> <tr> <td>test</td> </tr> </table> </td> </tr> </table> <p>two</p>

    Cheers,
    Ovid

    New address of my CGI Course.
    Silence is Evil (feel free to copy and distribute widely - note copyright text)

Re: Regular Expression
by davorg (Chancellor) on Jan 02, 2003 at 16:45 UTC

    Please don't try to use regular expressions to parse HTML. It will generally end in tears.

    CPAN has a number of modules that you can use to parse HTML properly. In this case HTML::TreeBuilder looks like it might be your best bet.

    --
    <http://www.dave.org.uk>

    "The first rule of Perl club is you do not talk about Perl club."
    -- Chip Salzenberg

    s/Treebuilder/TreeBuilder/ - dvergin 2003-01-02

(jeffa) Re: Regular Expression
by jeffa (Bishop) on Jan 02, 2003 at 16:47 UTC
    You might simplify your task by slurping the whole file instead of parsing it one line at a time:
    my $html = do {local $/;<FILE>};
    Then you could use a regex like:
    $html =~ s/<table[^>]+>.+<\/table>//s;
    And that will work ... for a little while - nested tables will throw a monkey wrench into that regex. Your best bet is to RTFM and use a real HTML Parser.

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    B--B--B--B--B--B--B--B--
    H---H---H---H---H---H---
    (the triplet paradiddle with high-hat)