Extracting from HTML tables

Cody Pendant has asked for the wisdom of the Perl Monks concerning the following question:

I'm using HTML::TableExtract to scrape a website.

The module either works in html mode or it doesn't, as set by the keep_html option in the constructor.

Troubke is, I want to get at some columns as text and others as HTML, (to get at some productIDs in URLs).

The workaround goes like:

make the first TableExtract object with keep_html off, go through table rows creating an AoH with the text values.
make a second TableExtract object with keep_html on, go through table rows again updating the AoH with the values from HTML.

Obviously I'd be in big trouble if the two parsers didn't find the same data, but that's not a problem.

Is there a smarter way to do this or another table module which would help?

TIA

($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss')
=~y~b-v~a-z~s; print

Comment on Extracting from HTML tables Select or Download Code

Replies are listed 'Best First'.
Re: Extracting from HTML tables by tempest (Sexton) on Apr 17, 2006 at 09:18 UTC
you can use HTML::TagFilter to get rid of unwanted markup once you get it... best i can think of.	[reply]
Re: Extracting from HTML tables by mojotoad (Monsignor) on Apr 17, 2006 at 21:47 UTC
Hi Cody, If you extract in 'tree' mode then the returned structure is actually a full-fledged HTML::ElementTable object. Example usage similar to what you seem to want: #!/usr/bin/perl use strict; use warnings; # load in 'tree' mode for working with # HTML::Element structures. note that in # this case, subtables are not decoupled # from one another. use HTML::TableExtract 'tree'; my $te = HTML::TableExtract->new( # extraction parameters here...note that # in tree mode, keep_html is irrelevant ); $te->parse_file("./myfile.html"); my $t = $te->first_table_found or die "oops, no tables.\n"; # at this point we can work with $t->rows and the # cells within, but rather than text or html, the # content is now individual element objects/trees # for html... print "H::TE as html:\n"; foreach my $row ($t->rows) { print join(':', map { $_->as_HTML } @$row), "\n"; } # for text... print "H::TE as text:\n"; foreach my $row ($t->rows) { print join(':', map { $_->as_text } @$row), "\n"; } # Alternatively, you could switch entirely over # to the HTML::ElementTable structure my $et = $t->tree; # as html print "H::ET as html:\n"; print $et->as_HTML, "\n"; # as text print "H::ET as text:\n"; print $et->as_text, "\n"; [download] Cheers, Matt	[reply] [d/l]
Re^2: Extracting from HTML tables by Cody Pendant (Prior) on Apr 18, 2006 at 00:24 UTC
Very sensible, thanks a lot. That's much more efficient than going over two copies of the same data with two different agents. ($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss') =~y~b-v~a-z~s; print	[reply]