Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
I should have known better than to create a root node that only contained one line of actual perl code. Let's try this again, this time fueled by a bit more sleep.

My goal is to extract the data from a table (for this example we'll use this one), where I know only the headers for the fields. Thanks to HTML::TableExtract's headers method, this is quite simple:

use strict; use HTML::TableExtract; # I'm using LWP in the real code, but this is a minimalistic attempt a +t a working example my $html_doc_name = '/tmp/symbols.html'; my $html_doc_string; my $te = new HTML::TableExtract( headers => ['Character', 'Entity'] ); my $ts; my $row; undef $/; # the absence of this one little line always causes me + so much trouble open(HTML, $html_doc_name) or die "Couldn't open html file: $!\n"; $html_doc_string = <HTML>; close(HTML) or die "Couldn't close html file: $!\n"; $te->parse($html_doc_string); # Examine all matching tables foreach $ts ($te->table_states) { print "Table (", join(',', $ts->coords), "):\n"; foreach $row ($ts->rows) { print join("\t\t", @$row), "\n"; } }


This gives me the data I'm looking for. However, if the header I'm looking for is an image (usually of stylized text stating what the columns represent), this ceases to work. Say that, rather than those columns being labeled 'Character' and 'Entity' they were <img src="http://www.htmlhelp.com/images/Character.jpeg"> and <img src="http://www.htmlhelp.com/images/Entity.jpeg">, respectively. With this one, seemingly minor change to the headers, this code suddenly won't work, even if I make the appropriate modifications to the header criteria. As stated above, my suspicion is that this is due to the fact that, as the image urls are now HTML::Parser objects rather than plain text, HTML::TableExtract is skipping over them and looking only in the plaintext portion of the html. My question is this: is there a way to make TableExtract look in the image tags for my selection criteria? If I can't do that directly, can I tell HTML::Parser itself that I'd like it to treat image tags as plain text, (presumably making TableExtract work as it does with plaintext headers)? Is there perhaps some other method entirely which I should be using?

Hopefully this time my question is clear enough to warrant something other than upvotes for effort. :).

And no, I don't own 27 pairs of sweatpants.

In reply to Re: using the headers method of HTML::TableExtract to find an image by brainpan
in thread using the headers method of HTML::TableExtract to find an image by brainpan

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (6)
As of 2024-03-28 19:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found