in reply to Re: Question on extracting HTML tables with HTML::TableExtract
in thread Question on extracting HTML tables with HTML::TableExtract

Reading the same docs it looks like keep_html keeps any markup within the cell (e.g. text formatting) but not the tags that define the table. That it doesn't keep embedded tables within a cell makes me think it removes all table stucture tags

The synopsis does give a different method that looks like the right thing through:

$table_html = $table_tree->as_HTML;

I haven't tried either one- maybe later on today if the op doesn't solve it.

Replies are listed 'Best First'.
Re^3: Question on extracting HTML tables with HTML::TableExtract
by ww (Archbishop) on Feb 26, 2012 at 19:48 UTC
    We appear to read the 2 line para on keep_html differently. IMO, it's open to multiple readings, but the next para (re strip...) seems more to support your view than mine.

    So, I bet we can agree without dissent that the doc needs some improvement.

    That said, let's think about Parse::HTML. It won't do the whole job, by itself, for OP (but as noted above, we don't know how OP is doing whatever led to the SOPW), but with no added code... and just a little additional, and a few tweaks to the code in the example entitled "The Identity Parser, it should be no great problem achieve OP's objective.

      Clarity in module docs has always been the thing that bugged me the most about learning Perl, so I end up doing quite a bit of experimental programming. That said, I use Perl mostly because of CPAN-- anything I want to do, someone else has mostly solved already. I liked this problem because when I did it myself a long time ago for a scraper (that's been running a few times a week for a few years) I did it the brute force way with a regex and identifying the text around it that tells me it's the table I want. Then it goes into HTML::Treebuilder to get the data I want. I was going to suggest using Treebuilder, until I read the docs, which had the example almost written already.