Question on extracting HTML tables with HTML::TableExtract

redmage has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Question on extracting HTML tables with HTML::TableExtract by bitingduck (Deacon) on Feb 26, 2012 at 19:53 UTC
Look down to the bottom of the synopsis in the HTML:TableExtract docs for everything you need. I just ran this #!/usr/bin/perl -w use strict; use warnings; use HTML::TableExtract qw(tree); use Data::Dumper; my $table_string = <<EOF; <table width="100%" bgcolor="#ffffff"> <tr> <td>Larry & Gloria</td> <td>Mountain View</td> <td>California</td> </tr> <tr> <td><b>Tom</b></td> <td>Boulder</td> <td>Colorado</td> </tr> <tr> <td>Nathan & Jenine</td> <td>Fort Collins</td> <td>Colorado</td> </tr> </table> EOF my $te = HTML::TableExtract->new(keep_html=>1); $te->parse($table_string); my $table = $te->first_table_found; my $table_tree = $table->tree; my $table_html = $table_tree->as_HTML; my $table_text = $table_tree->as_text; print Dumper($table_html),"\n"; print Dumper($table_text),"\n"; [download] I was too lazy to dig up a table earlier, but ran across one I could paste in while I was reading so I went ahead and tested it. If you change the `keep_html=>1` to `keep_html=>0` then the as_HTML will strip all the markup except the table tags, and the as_text will strip out all the table tags, too.	[reply] [d/l] [select]
Re: Question on extracting HTML tables with HTML::TableExtract by ww (Archbishop) on Feb 26, 2012 at 19:02 UTC
Sorry, I misread this: ~~http://www.perl.com/pub/2003/09/17/perlcookbook.html~~ -- it's NOT relevant to retaining the html markup. So, back to HTML::TableExtract -- the documentation makes it very clear that `keep_html` SHOULD keep the markup. Thus, the question becomes, could there be an error in your code? We won't know the answer to that until you post it... and a bit of information about how you're acquiring the table in the first place.	[reply] [d/l]
Re^2: Question on extracting HTML tables with HTML::TableExtract by bitingduck (Deacon) on Feb 26, 2012 at 19:22 UTC
Reading the same docs it looks like keep_html keeps any markup within the cell (e.g. text formatting) but not the tags that define the table. That it doesn't keep embedded tables within a cell makes me think it removes all table stucture tags The synopsis does give a different method that looks like the right thing through: `$table_html = $table_tree->as_HTML;` I haven't tried either one- maybe later on today if the op doesn't solve it.	[reply] [d/l]
Re^3: Question on extracting HTML tables with HTML::TableExtract by ww (Archbishop) on Feb 26, 2012 at 19:48 UTC
We appear to read the 2 line para on keep_html differently. IMO, it's open to multiple readings, but the next para (re strip...) seems more to support your view than mine. So, I bet we can agree without dissent that the doc needs some improvement. That said, let's think about Parse::HTML. It won't do the whole job, by itself, for OP (but as noted above, we don't know how OP is doing whatever led to the SOPW), but with no added code... and just a little additional, and a few tweaks to the code in the example entitled "The Identity Parser, it should be no great problem achieve OP's objective.	[reply]
Re^4: Question on extracting HTML tables with HTML::TableExtract by bitingduck (Deacon) on Feb 26, 2012 at 20:07 UTC
Re: Question on extracting HTML tables with HTML::TableExtract by tangent (Parson) on Feb 26, 2012 at 19:46 UTC
Once you have the table, you can build a tree and then call $table_tree->as_HTML: `use HTML::TableExtract qw(tree); my $te = HTML::TableExtract->new(); $te->parse($string_of_html); my $table = $te->first_table_found; my $table_tree = $table->tree; my $table_html = $table_tree->as_HTML; print $table_html;` [download] You can also call tables() to get a list of all the tables and then put a loop around the above. You need to have the optional HTML::TreeBuilder and HTML::ElementTable installed for this to work.	[reply] [d/l]