in reply to Question on extracting HTML tables with HTML::TableExtract

Sorry, I misread this: http://www.perl.com/pub/2003/09/17/perlcookbook.html   -- it's NOT relevant to retaining the html markup.

So, back to HTML::TableExtract -- the documentation makes it very clear that keep_html SHOULD keep the markup.

Thus, the question becomes, could there be an error in your code? We won't know the answer to that until you post it... and a bit of information about how you're acquiring the table in the first place.

Replies are listed 'Best First'.
Re^2: Question on extracting HTML tables with HTML::TableExtract
by bitingduck (Deacon) on Feb 26, 2012 at 19:22 UTC

    Reading the same docs it looks like keep_html keeps any markup within the cell (e.g. text formatting) but not the tags that define the table. That it doesn't keep embedded tables within a cell makes me think it removes all table stucture tags

    The synopsis does give a different method that looks like the right thing through:

    $table_html = $table_tree->as_HTML;

    I haven't tried either one- maybe later on today if the op doesn't solve it.

      We appear to read the 2 line para on keep_html differently. IMO, it's open to multiple readings, but the next para (re strip...) seems more to support your view than mine.

      So, I bet we can agree without dissent that the doc needs some improvement.

      That said, let's think about Parse::HTML. It won't do the whole job, by itself, for OP (but as noted above, we don't know how OP is doing whatever led to the SOPW), but with no added code... and just a little additional, and a few tweaks to the code in the example entitled "The Identity Parser, it should be no great problem achieve OP's objective.

        Clarity in module docs has always been the thing that bugged me the most about learning Perl, so I end up doing quite a bit of experimental programming. That said, I use Perl mostly because of CPAN-- anything I want to do, someone else has mostly solved already. I liked this problem because when I did it myself a long time ago for a scraper (that's been running a few times a week for a few years) I did it the brute force way with a regex and identifying the text around it that tells me it's the table I want. Then it goes into HTML::Treebuilder to get the data I want. I was going to suggest using Treebuilder, until I read the docs, which had the example almost written already.