redmage has asked for the wisdom of the Perl Monks concerning the following question:

Hello all,

I'm trying to do something which should be relatively straight-forward, but I'm having a terrible time trying to do it.

I am trying to extract an HTML table from a string of HTML. However, I'm trying to extract the table with the "table" "tr" and "td" tags intact, As far as I can tell the TableExtract module doesn't seem to return these, even if I specify the keep_html flag in the constructor. How do I tell TableExtract to keep the table HTML tags? Or is there a better way to do this than using the tablextract module?

Thanks
Redmage
  • Comment on Question on extracting HTML tables with HTML::TableExtract

Replies are listed 'Best First'.
Re: Question on extracting HTML tables with HTML::TableExtract
by bitingduck (Deacon) on Feb 26, 2012 at 19:53 UTC

    Look down to the bottom of the synopsis in the HTML:TableExtract docs for everything you need. I just ran this

    #!/usr/bin/perl -w use strict; use warnings; use HTML::TableExtract qw(tree); use Data::Dumper; my $table_string = <<EOF; <table width="100%" bgcolor="#ffffff"> <tr> <td>Larry &amp; Gloria</td> <td>Mountain View</td> <td>California</td> </tr> <tr> <td><b>Tom</b></td> <td>Boulder</td> <td>Colorado</td> </tr> <tr> <td>Nathan &amp; Jenine</td> <td>Fort Collins</td> <td>Colorado</td> </tr> </table> EOF my $te = HTML::TableExtract->new(keep_html=>1); $te->parse($table_string); my $table = $te->first_table_found; my $table_tree = $table->tree; my $table_html = $table_tree->as_HTML; my $table_text = $table_tree->as_text; print Dumper($table_html),"\n"; print Dumper($table_text),"\n";

    I was too lazy to dig up a table earlier, but ran across one I could paste in while I was reading so I went ahead and tested it. If you change the keep_html=>1 to keep_html=>0 then the as_HTML will strip all the markup except the table tags, and the as_text will strip out all the table tags, too.

Re: Question on extracting HTML tables with HTML::TableExtract
by ww (Archbishop) on Feb 26, 2012 at 19:02 UTC
    Sorry, I misread this: http://www.perl.com/pub/2003/09/17/perlcookbook.html   -- it's NOT relevant to retaining the html markup.

    So, back to HTML::TableExtract -- the documentation makes it very clear that keep_html SHOULD keep the markup.

    Thus, the question becomes, could there be an error in your code? We won't know the answer to that until you post it... and a bit of information about how you're acquiring the table in the first place.

      Reading the same docs it looks like keep_html keeps any markup within the cell (e.g. text formatting) but not the tags that define the table. That it doesn't keep embedded tables within a cell makes me think it removes all table stucture tags

      The synopsis does give a different method that looks like the right thing through:

      $table_html = $table_tree->as_HTML;

      I haven't tried either one- maybe later on today if the op doesn't solve it.

        We appear to read the 2 line para on keep_html differently. IMO, it's open to multiple readings, but the next para (re strip...) seems more to support your view than mine.

        So, I bet we can agree without dissent that the doc needs some improvement.

        That said, let's think about Parse::HTML. It won't do the whole job, by itself, for OP (but as noted above, we don't know how OP is doing whatever led to the SOPW), but with no added code... and just a little additional, and a few tweaks to the code in the example entitled "The Identity Parser, it should be no great problem achieve OP's objective.

Re: Question on extracting HTML tables with HTML::TableExtract
by tangent (Parson) on Feb 26, 2012 at 19:46 UTC
    Once you have the table, you can build a tree and then call $table_tree->as_HTML:
    use HTML::TableExtract qw(tree); my $te = HTML::TableExtract->new(); $te->parse($string_of_html); my $table = $te->first_table_found; my $table_tree = $table->tree; my $table_html = $table_tree->as_HTML; print $table_html;
    You can also call tables() to get a list of all the tables and then put a loop around the above.

    You need to have the optional HTML::TreeBuilder and HTML::ElementTable installed for this to work.