Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Grabbing Embedded Tables from HTML

by Anonymous Monk
on Jan 29, 2004 at 06:59 UTC ( [id://324889]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I would like to grab information from tables such as the on located at http://www.tvguide.com/listings/index.asp or any table like that. I know that it isn't a regular HTML table so how do I grab the data? I have tried everything that I can think of and can not get the inforamtion from the tv listing table. Any suggestions? Thanks, J

Replies are listed 'Best First'.
Re: Grabbing Embedded Tables from HTML
by davido (Cardinal) on Jan 29, 2004 at 09:05 UTC
    Since I'm not anxious to register on some website just so I can see what on earth the tables you're talking about look like, how about trying to give a semi-technical description for us? You said they're not regular HTML tables, so what are they? Are they graphic images? Are they javascript entities? Are they PDF files?

    And since you've already tried everything you can think of, can you tell us what you thought of thus far, so that we know where you've already invested time? While you're at it, you might also let us know in what way your attempts fell short of meeting the need.

    Since I don't know better, I'll suggest that most websites worth their weight in salt will also be lynx-friendly. That being the case, perhaps the easiest way to get at the data from the tables in question is to parse the all-text output from lynx. It's easy to grab the output from lynx. ..of course this assumes you're on a linux/unix type system. In this way, you can use the robustness of lynx -- a full-fledged text-based browser capable of handling cookies, and all sorts of curve-balls -- to intelligently dump the site to text.


    Dave

Re: Grabbing Embedded Tables from HTML
by Corion (Patriarch) on Jan 29, 2004 at 09:22 UTC

    davido already has the problems with your post down, but for the general task of extracting data out of possibly nested tables, I can recommend mojotoads HTML::TableExtract. It turns HTML tables into easy accessible arrays.

    Another solution might be Template::Extract, if you already understand the Template::Toolkit syntax and want to convert a HTML page back into a Perl structure.

    perl -MHTTP::Daemon -MHTTP::Response -MLWP::Simple -e ' ; # The $d = new HTTP::Daemon and fork and getprint $d->url and exit;#spider ($c = $d->accept())->get_request(); $c->send_response( new #in the HTTP::Response(200,$_,$_,qq(Just another Perl hacker\n))); ' # web
Re: Grabbing Embedded Tables from HTML
by l3nz (Friar) on Jan 29, 2004 at 11:09 UTC
    Unfortunately in the case given here it's impossible to tell more without having a clue of how the table is made. If they use an image, for instance, it will be very hard to extract data from it using Perl. What we need for sure is that you'll have to automate a way to login to the site with your credentials before accessing the listings page; therefore I'd play with the LWP family modules.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://324889]
Approved by Roger
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (3)
As of 2024-03-28 17:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found