in reply to Extract data from table
For extracting information from webpages covered in tables, there are a couple of methods I find myself using over and over again. (a) Reduce the amount you're dealing with, usually by using regexps to extract relevant blocks, or discard irrelevant sections; (b) Use HTML::TableExtract or some other such tool to get the final information. These work really well in unison.
Now, if you're after the scoring summary, you've got a great piece of static text to look for, that being the "SCORING SUMMARY" heading. In fact, loooking at the HTML, there's a very distinctive span tag sitting there for us as well. <span id=linescore>, let's use that.
So, assuming we've grabbed the page (using LWP::UserAgent, or whatever you prefer to use), and it's sitting in $source, we can do this:
That will grab us everything from the span tag we identified, up until the end of the table that follows it.($scoretable) = $source =~ m#<span id=linescore>(.*?</table>)#s;
Now we just need to run HTML::TableExtract over it.
Taking an example almost directly from the HTML::TableExtract man page:
Which displays this:my $te = HTML::TableExtract->new; $te->parse($scoretable); foreach my $ts ($te->table_states) { print "Table (", join(",",$ts->coords), "):\n"; foreach my $row ($ts->rows) { print join(',', @$row),"\n"; } }
So, we have one table, with three rows. The first is just headings for the quarters, the second is the scores for Seattle, the third is for Oakland. From here, it's easy:Table (0,0): ,1,2,3,4,OT,T Seattle ,0,0,7,7, ,14 Oakland ,7,17,14,0, ,38
Ta da! $team1_scores and $team2_scores are arrayrefs containing the score information we want. Now we never need to watch the football again. ;)my $parsed_table = ($te->table_states)[0]; my ($team1_scores, $team2_scores) = ($parsed_table->rows)[1,2];
I've listed the entire program below, for your convenience. You'll need to save the source of the website and give it as a command line argument or on STDIN. Of course, your real program would no doubt fetch it by itself.
Cheers,
Paul
#!/usr/bin/perl -w # Example of how to parse football scores. # Written by Paul Fenwick <pjf@cpan.org>, October 2001 # Used as an example on PerlMonks. use strict; use HTML::TableExtract; use Data::Dumper; # Normally we'd use LWP::UserAgent to fetch the source, # here we just load it from STDIN/ARGV. local $/ = undef; my $source = <>; # Extract our relevant table. my ($scoretable) = $source =~ m#<span id=linescore>(.*?</table>)#s; # Parse the table. my $te = HTML::TableExtract->new; $te->parse($scoretable); # Show the structure of what we've parsed. foreach my $ts ($te->table_states) { print "Table (", join(",",$ts->coords), "):\n"; foreach my $row ($ts->rows) { print join(',', @$row),"\n"; } } # Extract the interesting bits we want. my $parsed_table = ($te->table_states)[0]; my ($team1_scores, $team2_scores) = ($parsed_table->rows)[1,2]; # Use Data::Dumper to show that we've succeeded. print Dumper($team1_scores,$team2_scores); __END__
|
|---|