http://qs1969.pair.com?node_id=69112


in reply to using the headers method of HTML::TableExtract to find an image

I should have known better than to create a root node that only contained one line of actual perl code. Let's try this again, this time fueled by a bit more sleep.

My goal is to extract the data from a table (for this example we'll use this one), where I know only the headers for the fields. Thanks to HTML::TableExtract's headers method, this is quite simple:

use strict; use HTML::TableExtract; # I'm using LWP in the real code, but this is a minimalistic attempt a +t a working example my $html_doc_name = '/tmp/symbols.html'; my $html_doc_string; my $te = new HTML::TableExtract( headers => ['Character', 'Entity'] ); my $ts; my $row; undef $/; # the absence of this one little line always causes me + so much trouble open(HTML, $html_doc_name) or die "Couldn't open html file: $!\n"; $html_doc_string = <HTML>; close(HTML) or die "Couldn't close html file: $!\n"; $te->parse($html_doc_string); # Examine all matching tables foreach $ts ($te->table_states) { print "Table (", join(',', $ts->coords), "):\n"; foreach $row ($ts->rows) { print join("\t\t", @$row), "\n"; } }


This gives me the data I'm looking for. However, if the header I'm looking for is an image (usually of stylized text stating what the columns represent), this ceases to work. Say that, rather than those columns being labeled 'Character' and 'Entity' they were <img src="http://www.htmlhelp.com/images/Character.jpeg"> and <img src="http://www.htmlhelp.com/images/Entity.jpeg">, respectively. With this one, seemingly minor change to the headers, this code suddenly won't work, even if I make the appropriate modifications to the header criteria. As stated above, my suspicion is that this is due to the fact that, as the image urls are now HTML::Parser objects rather than plain text, HTML::TableExtract is skipping over them and looking only in the plaintext portion of the html. My question is this: is there a way to make TableExtract look in the image tags for my selection criteria? If I can't do that directly, can I tell HTML::Parser itself that I'd like it to treat image tags as plain text, (presumably making TableExtract work as it does with plaintext headers)? Is there perhaps some other method entirely which I should be using?

Hopefully this time my question is clear enough to warrant something other than upvotes for effort. :).

And no, I don't own 27 pairs of sweatpants.