Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to extract data from an HTML with a lot of tables. The problem is that the tables don't use the HTML header tags for their headers. The data I'm trying to retrieve is in a table that is within a table so I don't know the 'depth' or the 'count'. I tried jsut using the 'Headers' function to get the data, but that didn't work. Is their a way to do this? I looked at the HTML::TableExtract module, but the description didn't help me. Specifically I am trying to extract defensive statistics of a football game. Here is a sample webpage that I'm trying to extract the data from
http://scores.nfl.com/scores/2001/week3/gamecenter/NFL_20010930_SEA@OA +K.htm

Replies are listed 'Best First'.
Re: Extract data from table
by pjf (Curate) on Oct 02, 2001 at 03:56 UTC
    Gack, talk about squishing a lot of data onto a single page. Were you just after the scoring summary in the left hand column? I hope so.

    For extracting information from webpages covered in tables, there are a couple of methods I find myself using over and over again. (a) Reduce the amount you're dealing with, usually by using regexps to extract relevant blocks, or discard irrelevant sections; (b) Use HTML::TableExtract or some other such tool to get the final information. These work really well in unison.

    Now, if you're after the scoring summary, you've got a great piece of static text to look for, that being the "SCORING SUMMARY" heading. In fact, loooking at the HTML, there's a very distinctive span tag sitting there for us as well. <span id=linescore>, let's use that.

    So, assuming we've grabbed the page (using LWP::UserAgent, or whatever you prefer to use), and it's sitting in $source, we can do this:

    ($scoretable) = $source =~ m#<span id=linescore>(.*?</table>)#s;
    That will grab us everything from the span tag we identified, up until the end of the table that follows it.

    Now we just need to run HTML::TableExtract over it.

    Taking an example almost directly from the HTML::TableExtract man page:

    my $te = HTML::TableExtract->new; $te->parse($scoretable); foreach my $ts ($te->table_states) { print "Table (", join(",",$ts->coords), "):\n"; foreach my $row ($ts->rows) { print join(',', @$row),"\n"; } }
    Which displays this:
    Table (0,0): ,1,2,3,4,OT,T Seattle ,0,0,7,7, ,14 Oakland ,7,17,14,0, ,38
    So, we have one table, with three rows. The first is just headings for the quarters, the second is the scores for Seattle, the third is for Oakland. From here, it's easy:
    my $parsed_table = ($te->table_states)[0]; my ($team1_scores, $team2_scores) = ($parsed_table->rows)[1,2];
    Ta da! $team1_scores and $team2_scores are arrayrefs containing the score information we want. Now we never need to watch the football again. ;)

    I've listed the entire program below, for your convenience. You'll need to save the source of the website and give it as a command line argument or on STDIN. Of course, your real program would no doubt fetch it by itself.

    Cheers,
    Paul

    #!/usr/bin/perl -w # Example of how to parse football scores. # Written by Paul Fenwick <pjf@cpan.org>, October 2001 # Used as an example on PerlMonks. use strict; use HTML::TableExtract; use Data::Dumper; # Normally we'd use LWP::UserAgent to fetch the source, # here we just load it from STDIN/ARGV. local $/ = undef; my $source = <>; # Extract our relevant table. my ($scoretable) = $source =~ m#<span id=linescore>(.*?</table>)#s; # Parse the table. my $te = HTML::TableExtract->new; $te->parse($scoretable); # Show the structure of what we've parsed. foreach my $ts ($te->table_states) { print "Table (", join(",",$ts->coords), "):\n"; foreach my $row ($ts->rows) { print join(',', @$row),"\n"; } } # Extract the interesting bits we want. my $parsed_table = ($te->table_states)[0]; my ($team1_scores, $team2_scores) = ($parsed_table->rows)[1,2]; # Use Data::Dumper to show that we've succeeded. print Dumper($team1_scores,$team2_scores); __END__
Re: Extract data from table
by George_Sherston (Vicar) on Oct 02, 2001 at 04:24 UTC
    Monks who have spent more time than I roaming the caverns of CPAN with a waxen taper may be able to point you to a module that'll do this, but it does look a bit specialist. However it is perfectly possible to do it by hand, as it were, and that might be a fun exercise (only in perlmonks is this thought to be fun, however).

    The approach I'd take would be read the html file into a $scalar_variable and then run it through carefully devised regular expressions to boil it down to the info you want. You might prefer to read it into an @array line by line, but then you'd have to be sure they break the lines in a regular way, so I'd stick with the scalar.

    Helpfully, they seem to give you dandy little section identifiers like <!--RECEIVINGSTATS--> which you could use to break it into chunks for working on. Then you would probably want to do some progressive matching first to get each row of the table into an array element, and then to break each row down into cells. For example,
    push @output, $1 while $data =~ s/htm>(.*)<\/td><\/tr>//;
    I'm intentionally not insulting your intelligence by spelling it out too much, but if you ran that on the right chunk of data you'd only have to hack out the </a> tags and then split each element of @output on <td></td> and you'll have an array of arrays of... incomprehensible info (I speak as an Englishman who, insofar as he takes any interest in sport, follows cricket).

    I suggest you give it a whirl and if you get it to work let us know; or if you get some way and get stuck, post your code, with as many thoughts as you have about where you go wrong, and you'll find people glad to help you get the rest of the way.

    Also I suggest you check back here in a day to find out that someone tells you my way of doing it is a waste of your time and the easy way is...

    § George Sherston