root has asked for the wisdom of the Perl Monks concerning the following question:


Replies are listed 'Best First'.
RE: another regex Question
by Russ (Deacon) on May 24, 2000 at 09:04 UTC
    Just a quick response for you...

    If you add a ? after the *, it will make the expression non-greedy. As long as you don't have any nested tables, this will work. In greedy mode (the default), the regex will grab as much as it can into that .*, matching everything from the first correct table tag, through the very last closing table tag.

    my $thispage = q{ <TABLE border=0 cellPadding=2 cellSpacing=0>junk</TABLE> <TABLE border=0 bogusTag="Don't match this">more junk</TABLE> <TABLE border=0 cellPadding=2 cellSpacing=0>most junk</TABLE> }; $thispage=~s/<TABLE border=0 cellPadding=2 cellSpacing=0>.*?<\/TABLE>/ +/g; print $thispage;
    <TABLE border=0 cellPadding=2 cellSpacing=1>more junk</TABLE>

    There are better answers to this problem, as others have posted, but I think this "fixes" your regex... :-)


Re: another regex Question
by athomason (Curate) on May 24, 2000 at 08:15 UTC
    First read this faq. In a nutshell, HTML parsing, especially something like analyzing arbitrary tables, is pretty difficult. There are modules designed especially for this, though, so check out HTML::Parser and HTML::TokeParser. Also see answers to a similar question here.
      There's a ready-made subclass of HTML::Parser which should help.
      Check out HTML::TableExtract
      Yes, you need to be very careful with HTML, as:

      cellpadding, cellPadDing, ceLLpaddinG are all the same. I actually saw a table-extract module which may be of use to you. Personally, I would not advise doing any HTML parsing yourself. Use modules -- their authors know their stuff!

RE: another regex Question
by Michalis (Pilgrim) on May 24, 2000 at 14:14 UTC
    I think you shoulld be able to do it like that:

    if (/<TABLE border=0 cellPadding=2 cellspacing=0>(.*?)<\/TABLE>/) { print $1; }
Re: another regex Question
by johncoswell (Acolyte) on May 24, 2000 at 17:37 UTC
    Why not try a split? I process many HTML files with this command. Split the file on every occurance of <TABLE and you won't have to worry about eating up too many tables.

    @parsedfile = split(/\<TABLE/,$file);
    This way, you can concentrate on only one table at time and won't have to worry about greedy regexps.

    foreach $line (@parsedfile) { if ($line =~ /cellpadding\=2/) { do whatever } }

    John Coswell -

      You will run into problems with nested tables, won't you?

      A more complex solution would involve keeping track of the numbers of open/close table tags, so you can be sure that you have matches. For instance, each time you pass an open table tag, increment a counter, each time you pass a close table tag, decrement the counter, when the counter goes >1, you are inside a table, when it hits 0, you are outside of a table. If it hits 2 or more, you are inside a nested table.

      I don't know how feasible this is, but it might be useful.

      J. J. Horner

      Linux, Perl, Apache, Stronghold, Unix

        Definitely true. 8^) I guess it depends on if you use nested tables and need to keep track of the nesting for some purpose. If you have a table nested within a table, and you just want to delete the table definition, the information would be plopped into the outer table's cell without formatting, kind of like how you merge cells in PageMill.
Re: another regex Question
by KM (Priest) on May 24, 2000 at 18:08 UTC
    Stop steering him in the direction of reinventing the wheel. The answer to look at HTML::Parser and HTML::TokeParser and the other HTML::* is the best. Don't try to reinvent what already works.


Re: a regex Question
by BigJoe (Curate) on May 25, 2000 at 00:46 UTC
    Actually I am just trying to remove all the embedded tables that are in the HTML file but leave the one main table