Re: Extract data from table

Gack, talk about squishing a lot of data onto a single page. Were you just after the scoring summary in the left hand column? I hope so.

For extracting information from webpages covered in tables, there are a couple of methods I find myself using over and over again. (a) Reduce the amount you're dealing with, usually by using regexps to extract relevant blocks, or discard irrelevant sections; (b) Use HTML::TableExtract or some other such tool to get the final information. These work really well in unison.

Now, if you're after the scoring summary, you've got a great piece of static text to look for, that being the "SCORING SUMMARY" heading. In fact, loooking at the HTML, there's a very distinctive span tag sitting there for us as well. <span id=linescore>, let's use that.

So, assuming we've grabbed the page (using LWP::UserAgent, or whatever you prefer to use), and it's sitting in $source, we can do this:

($scoretable) = $source =~ m#<span id=linescore>(.*?</table>)#s;
[download]

That will grab us everything from the span tag we identified, up until the end of the table that follows it.

Now we just need to run HTML::TableExtract over it.

Taking an example almost directly from the HTML::TableExtract man page:

my $te = HTML::TableExtract->new;

$te->parse($scoretable);

foreach my $ts ($te->table_states) {
        print "Table (", join(",",$ts->coords), "):\n";
        foreach my $row ($ts->rows) {
                print join(',', @$row),"\n";
        }
}
[download]

Which displays this:

Table (0,0):
 ,1,2,3,4,OT,T
Seattle ,0,0,7,7, ,14
Oakland ,7,17,14,0, ,38
[download]

So, we have one table, with three rows. The first is just headings for the quarters, the second is the scores for Seattle, the third is for Oakland. From here, it's easy:

my $parsed_table = ($te->table_states)[0];
my ($team1_scores, $team2_scores) = ($parsed_table->rows)[1,2];
[download]

Ta da! $team1_scores and $team2_scores are arrayrefs containing the score information we want. Now we never need to watch the football again. ;)

I've listed the entire program below, for your convenience. You'll need to save the source of the website and give it as a command line argument or on STDIN. Of course, your real program would no doubt fetch it by itself.

Cheers,
Paul

#!/usr/bin/perl -w

# Example of how to parse football scores.
# Written by Paul Fenwick <pjf@cpan.org>, October 2001
# Used as an example on PerlMonks.

use strict;
use HTML::TableExtract;
use Data::Dumper;

# Normally we'd use LWP::UserAgent to fetch the source,
# here we just load it from STDIN/ARGV.

local $/ = undef;
my $source = <>;

# Extract our relevant table.

my ($scoretable) = $source =~ m#<span id=linescore>(.*?</table>)#s;

# Parse the table.
my $te = HTML::TableExtract->new;
$te->parse($scoretable);

# Show the structure of what we've parsed.
foreach my $ts ($te->table_states) {
        print "Table (", join(",",$ts->coords), "):\n";
        foreach my $row ($ts->rows) {
                print join(',', @$row),"\n";
        }
}

# Extract the interesting bits we want.
my $parsed_table = ($te->table_states)[0];
my ($team1_scores, $team2_scores) = ($parsed_table->rows)[1,2];

# Use Data::Dumper to show that we've succeeded.
print Dumper($team1_scores,$team2_scores);

__END__
[download]

Comment on Re: Extract data from table Select or Download Code