HTML:TableExtract...?

thoth has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: HTML:TableExtract...? by sauoq (Abbot) on Dec 30, 2002 at 20:42 UTC
Hmmm where to begin... How about with reading the documentation? The examples make it pretty clear that the parse method parses the table out into something the author calls "table states" and that those are available via `$te->table_states()` which returns a list of them. Then you have to use yet another method to get at the rows in that "table state." There is a shortcut whereby you can invoke the row method on the original TableExtract object (after calling the parse method) and it will return a list of the rows in the first table in the document. I've never used the module. On first inspection, I think its interface could be improved. -sauoq "My two cents aren't worth a dime.";	[reply] [d/l]
Re: Re: HTML:TableExtract...? by Trimbach (Curate) on Dec 30, 2002 at 22:37 UTC
The "table states" are used to deal with situations where you have tables within cells of other tables within cells of other tables (...etc.) The primary aim of the module is to "extract" data from heavily formatted web pages, not tables used to store plain data. Yeah, it's overkill if you want to suck data out a data table with no embedded subtables, but that's why there's shortcut methods. It's actually a pretty useful module for anyone who's tried to zero in on some piece of data on a webpage and have pulled their hair out trying to get a home-rolled regex-based or HTML::Parser solution to work. It's quite a spiffy module. Lets you get on and worry about other things than deciphering a page full of td tags. :-) Gary Blackburn Trained Killer	[reply]
Re: HTML:TableExtract...? by Marza (Vicar) on Dec 30, 2002 at 22:20 UTC
It might be how you are loading the page for tableextract. Here is a old little sub I used to get versions from the mcafee site. It does not have link headers but the info I needed to locate were links. See if it helps eluminate your situation. sub GetCurrentVersion { my $key; my $dat; my $superdat; my $engine; my $html_code; my $row; my $ts; # # Lets access the Network Associates download page for the Virus I +nfo. my $ua = new LWP::UserAgent; my $url = 'http://www.mcafeeb2b.com/naicommon/download/dats/find.a +sp'; my $request = new HTTP::Request('GET',$url); # # Do we need to login? #$request->authorization_basic('login', 'password'); $ua->timeout(10); my $response = $ua->request($request); my $responsecode = $response->code(); # # Now we need to gather the information. if ($responsecode != 200) { print "Failed to Access the Mcafee site!: $responsecode\n"; } else { # # Load the HTML junk into a var. my @array = (split "\n", $ua->request($request)->as_string); foreach (@array) { $html_code .= $_ . "\n"; } } # # It's time to use TableExtract. We use the File Version and Date +for # header info to locate our tables. Once found; we look for certa +in # names and assing the version info to our needed vars. my $te = new HTML::TableExtract( headers => [qw(File Version Date) +] ); $te->parse($html_code); foreach $ts ($te->table_states) { #print "Table found at ", join(',', $ts->coords), ":\n"; foreach $row ($ts->rows) { # # DAT Version if (@$row[0] =~ /DAT File for weekly v4x $DAT Only$/) { #print "DAT = @$row[1]\n"; $dat = @$row[1]; } # # SuperDAT version which has both Engine and DAT. if (@$row[0] =~ /SuperDat File for v4x $DAT \+ Engine$/) + { #print "Superdat = @$row[1]\n"; $superdat = @$row[1]; } # # Engine only Version. if (@$row[0] =~ /Superdat File for v4x $Intel Engine only +$/) { $engine = @$row[1]; $engine = int $engine; } #print " ", join(' , ', @$row), "\n"; } } return($dat, $superdat, $engine); } [download]	[reply] [d/l]
Re: Re: HTML:TableExtract...? by thoth (Novice) on Dec 31, 2002 at 18:38 UTC
Thanks for all of the help that you have provided. I had an error in the part that handles the printing of the items. I had cut/paste it from a site and it had the error. Which forced me to try another method that didn't work either...this is the correct code. `foreach $row ($te->rows) { print join(',', @$row), "\n"; }` [download] handles it nicely. Thoth	[reply] [d/l]