Putting HTML::Element content

TsuDohNihm has asked for the wisdom of the Perl Monks concerning the following question:

Greetings Monks,

I am trying to scrape the TV schedule from the C-SPAN website at http://inside.c-spanarchives.org:8080/cspan/schedule.csp I am a C-SPAN junkie (which may seem weird to all the CPAN junkies here). Here is what I have been struggling with so far:

#!/usr/bin/perl -w
use warnings;
use strict;
use HTML::TreeBuilder;

my $html = <<'EOHTML';
<table cellspacing="0" cellpadding="2" border="0" width="100%">
    <tr>
      <td width="20%" align="right" valign="top" bgcolor="#CCCCCC">
      <font face="arial, helvetica" size="2"><b>02:44 AM
      EDT</b></font><br>
      <font face="arial, helvetica" size="1">0:42
      (est.)</font><br></td>

      <td valign="top"><font face="arial, helvetica" size=
      "1">Speech</font><br>
      <font face="arial, helvetica" size="2"><a href=
      "/cspan/cspan.csp?command=dprogram&amp;record=142524675">U.S.-Ja
+pan
      Relations</a><br>
      Asia Society, Washington Center<br></font> <font face=
      "arial, helvetica" size="2" color="#CC0000">Ryozo Kato</font>
      <font face="arial, helvetica" size="1">, Japan</font></td>
    </tr>
  </table>
EOHTML


my $tree = HTML::TreeBuilder->new_from_content($html);
   $tree->parse_content($html);
my $c = $tree->look_down(
        "_tag", "table",
        "width", "100%"
        );
my @trimmed_text = map ( ref($c) ? $c->as_trimmed_text : $c, $c->conte
+nt_list );

print "@trimmed_text\n";

#
[download]

This prints:
02:44 AM EDT0:42 (est.)SpeechU.S.-Japan Relations Asia Society, Washington Center Ryozo Kato , Japan

But I am looking to put the data above into a hash of this form:

 %h_cspan = (
    time   => 02:44 AM EDT
    length => 0:42 (est.)
    type   => Speech
    title  => U.S.-Japan Relations
    org    => Asia Society, Washington Center
 );
[download]

Is this even possible with the HTML::ELement look_down method? Any help is appreciated.

Comment on Putting HTML::Element content_list into a hash Select or Download Code

Replies are listed 'Best First'.
Re: Putting HTML::Element content_list into a hash by wfsp (Abbot) on Apr 22, 2006 at 09:32 UTC
Does it have to be HTML::TreeBuilder? If not this might be useful. #!/usr/bin/perl use strict; use warnings; use Data::Dumper; use HTML::TokeParser::Simple; my $html = <<'EOHTML'; <table cellspacing="0" cellpadding="2" border="0" width="100%"> <tr> <td width="20%" align="right" valign="top" bgcolor="#CCCCCC"> <font face="arial, helvetica" size="2"><b>02:44 AM EDT</b></font><br> <font face="arial, helvetica" size="1">0:42 (est.)</font><br></td> <td valign="top"><font face="arial, helvetica" size= "1">Speech</font><br> <font face="arial, helvetica" size="2"><a href= "/cspan/cspan.csp?command=dprogram&record=142524675">U.S.-Ja +pan Relations</a><br> Asia Society, Washington Center<br></font> <font face= "arial, helvetica" size="2" color="#CC0000">Ryozo Kato</font> <font face="arial, helvetica" size="1">, Japan</font></td> </tr> </table> EOHTML my $tp = HTML::TokeParser::Simple->new(\$html) or die "Couldn't parse string: $!"; my $start; my @scraped; while (my $t = $tp->get_token) { $start++, next if $t->is_start_tag('table'); next unless $start; my $text = $tp->get_trimmed_text('br'); push @scraped, $text; } my @keys = qw(time length type title org1 org2); my %h_cpan = map {$keys[$_] => $scraped[$_]} (0..$#keys); #for (0..$#keys){ # $h_cpan{$keys[$_]} = $scraped[$_]; #} print Dumper \%h_cpan; [download] Output: `---------- Capture Output ---------- > "C:\Perl\bin\perl.exe" _new.pl $VAR1 = { 'org2' => 'Ryozo Kato , Japan', 'length' => '0:42 (est.)', 'time' => '02:44 AM EDT', 'org1' => 'Asia Society, Washington Center', 'title' => 'U.S.-Japan Relations', 'type' => 'Speech' }; > Terminated with exit code 0.` [download] Couple of points. The org key is in two parts. Also there is probably a more elegant way of loading the array into the hash. Hope that helps. Update: Changed the for loop to map.	[reply] [d/l] [select]
Re: Putting HTML::Element content_list into a hash by bobf (Monsignor) on Apr 22, 2006 at 21:26 UTC
Here is one way to do it, using HTML::TableExtract (written by our own mojotoad): use strict; use warnings; use HTML::TableExtract; use Data::Dumper; my $html = <<'EOHTML'; <table cellspacing="0" cellpadding="2" border="0" width="100%"> <tr> <td width="20%" align="right" valign="top" bgcolor="#CCCCCC"> <font face="arial, helvetica" size="2"><b>02:44 AM EDT</b></font><br> <font face="arial, helvetica" size="1">0:42 (est.)</font><br></td> <td valign="top"><font face="arial, helvetica" size= "1">Speech</font><br> <font face="arial, helvetica" size="2"><a href= "/cspan/cspan.csp?command=dprogram&record=142524675">U.S.-Ja +pan Relations</a><br> Asia Society, Washington Center<br></font> <font face= "arial, helvetica" size="2" color="#CC0000">Ryozo Kato</font> <font face="arial, helvetica" size="1">, Japan</font></td> </tr> </table> EOHTML my $te = HTML::TableExtract->new(); $te->parse( $html ); my @datatypes = qw( time length type title org ); foreach my $ts ( $te->table_states ) { foreach my $row ( $ts->rows ) { my @extracted = map { split( /\n\n/, $_ ) } @$row; my %data = map { $datatypes[$_] => clean_whitespace( $extracte +d[$_] ) } ( 0 .. $#datatypes ); print Dumper( \%data ); } } sub clean_whitespace { my ( $text ) = @_; $text =~ s/^\s+//; $text =~ s/\s+$//; $text =~ s/\s+/ /g; return $text; } __OUTPUT__ $VAR1 = { 'org' => 'Asia Society, Washington Center Ryozo Kato , Japan +', 'title' => 'U.S.-Japan Relations', 'length' => '0:42 (est.)', 'type' => 'Speech', 'time' => '02:44 AM EDT' }; [download]	[reply] [d/l]