in reply to Extract data from table

Gack, talk about squishing a lot of data onto a single page. Were you just after the scoring summary in the left hand column? I hope so.

For extracting information from webpages covered in tables, there are a couple of methods I find myself using over and over again. (a) Reduce the amount you're dealing with, usually by using regexps to extract relevant blocks, or discard irrelevant sections; (b) Use HTML::TableExtract or some other such tool to get the final information. These work really well in unison.

Now, if you're after the scoring summary, you've got a great piece of static text to look for, that being the "SCORING SUMMARY" heading. In fact, loooking at the HTML, there's a very distinctive span tag sitting there for us as well. <span id=linescore>, let's use that.

So, assuming we've grabbed the page (using LWP::UserAgent, or whatever you prefer to use), and it's sitting in $source, we can do this:

($scoretable) = $source =~ m#<span id=linescore>(.*?</table>)#s;
That will grab us everything from the span tag we identified, up until the end of the table that follows it.

Now we just need to run HTML::TableExtract over it.

Taking an example almost directly from the HTML::TableExtract man page:

my $te = HTML::TableExtract->new; $te->parse($scoretable); foreach my $ts ($te->table_states) { print "Table (", join(",",$ts->coords), "):\n"; foreach my $row ($ts->rows) { print join(',', @$row),"\n"; } }
Which displays this:
Table (0,0): ,1,2,3,4,OT,T Seattle ,0,0,7,7, ,14 Oakland ,7,17,14,0, ,38
So, we have one table, with three rows. The first is just headings for the quarters, the second is the scores for Seattle, the third is for Oakland. From here, it's easy:
my $parsed_table = ($te->table_states)[0]; my ($team1_scores, $team2_scores) = ($parsed_table->rows)[1,2];
Ta da! $team1_scores and $team2_scores are arrayrefs containing the score information we want. Now we never need to watch the football again. ;)

I've listed the entire program below, for your convenience. You'll need to save the source of the website and give it as a command line argument or on STDIN. Of course, your real program would no doubt fetch it by itself.

Cheers,
Paul

#!/usr/bin/perl -w # Example of how to parse football scores. # Written by Paul Fenwick <pjf@cpan.org>, October 2001 # Used as an example on PerlMonks. use strict; use HTML::TableExtract; use Data::Dumper; # Normally we'd use LWP::UserAgent to fetch the source, # here we just load it from STDIN/ARGV. local $/ = undef; my $source = <>; # Extract our relevant table. my ($scoretable) = $source =~ m#<span id=linescore>(.*?</table>)#s; # Parse the table. my $te = HTML::TableExtract->new; $te->parse($scoretable); # Show the structure of what we've parsed. foreach my $ts ($te->table_states) { print "Table (", join(",",$ts->coords), "):\n"; foreach my $row ($ts->rows) { print join(',', @$row),"\n"; } } # Extract the interesting bits we want. my $parsed_table = ($te->table_states)[0]; my ($team1_scores, $team2_scores) = ($parsed_table->rows)[1,2]; # Use Data::Dumper to show that we've succeeded. print Dumper($team1_scores,$team2_scores); __END__