Hello Perl Monks, I have to parse a HTML page to find the speakers listed on the page and count the numbers of sessions or tutorials each speaker has in total. This is a homework assignment so I'm not looking for someone to 'do' it for me but to put me in the right direction
The following is the rough format of the page. I've removed actual links and data to retain how the data I need to parse is formed. Below that is the code I have written so far.
<h2>Speakers</h2> <p>Overview of what to expect</p> <p>Our speaker list is growing. Please check back regularly to see who + we have lined up for you.</p> <span style="font-weight: bold; font-size: 1.0em;"><a href ="http://li +nktosite">A Speaker</a></span> <br /> Speaker's background <ul> <li><b>Tutorial: <a href="link to info about the tutorial">Description +</a></b></li> </ul> <p style="clear:both;"> <a href ="http://link to more info about speaker">Click here</a> for m +ore info. </p> <span style="font-weight: bold; font-size: 1.0em;"><a href ="http://li +nktosite about speaker">Another Speaker</a></span> <br /> Information about this speaker. <P> He is the author of xxxx <a href="http://link to book/"><i>Book name</i></a>, contributes <a href="http://link to article">articles</a> to more info. <ul> <li><b>Session: <a href="http://link to session">Session description</ +a></b></li> </ul> <p style="clear:both;"> <a href ="http://link to more info about speaker">Click here</a> for m +ore info. </p>
My code so far. I've managed to pull the speakers so far and place them into a hash but have so far been unable to work out a way to get the Session or Tutorial elements to be captured. There are other elements of the same format but I don't want to catch those instances. Some of the code is 'dodgy' as a result of different attempts to get the proper links.
Note: This is homework so I'm looking for guidance on where I am going wrong or suggestions on where I should be looking.
#!/usr/local/bin/perl use strict; use warnings; use lib "$ENV{HOME}/mylib/lib/perl5"; use HTML::TableParser; use WWW::Mechanize; use HTML::TreeBuilder; use LWP::Simple; # Define debugging variable - set to positive integer to enable my $DEBUG_FLAG = 1; # Define variable that will contain the URL we will parse my $URL = 'Path to URL speakers.html'; # Define our tree using HTML::Treebuilder and parse the document my $tree = HTML::TreeBuilder->new; $tree->parse( get( $URL ) ); # Define our hash that will contain speaker names and their count my %speakers; # Look for the elements (speakers) we are searching for based on the a +nchor "a" tag my @elements = $tree->look_down( _tag => "a", \&find_speakers ); # Populate our speaker hash and intialize the value to 0 for my $element ( @elements ) { $speakers{$element->as_text} = 0; } # Print list of speakers if debug mode is enabled if ( defined $DEBUG_FLAG ) { foreach (sort keys %speakers) { print "$_\n"; } } # Loop through each speaker - the goal here is eventually count all Se +ssion and Tutorial # links for each speaker foreach (keys %speakers) { #check_sessions($_); # my $element = $tree->look_down( _tag => "a", # sub { shift->as_text eq $_ } ); # print $element->as_text() . "\n"; # my @rightlist = $element->right(); # print "@rightlist\n"; # my $count = 0; # while ($element->look_down( _tag => "li", \&count_sessions ) ) # { # $count++; # } # print "$_ = $count\n"; #$element->dump(); } sub check_sessions { #print "@_\n"; my $speaker = shift; my $element = $tree->look_down( _tag => "li" ); my $parent = $element->look_up( _tag => "a", sub { shift->as_text eq $speaker } ); if (defined $parent) { if ( $element->as_text() =~ /[Session:]|[Tutorial:]/ ) { print $element->as_text() . "\n"; return 1; } else { return 0; } } else { return 0; } } # find_speakers subroutine finds the 'speakers' within the HTML being +parsed # based on the source being an anchor tag, it's parent tag not being a + line and # it is within a span tag sub find_speakers { my $element = shift; my ($parent_tag) = $element->lineage_tag_names; # Our parent tag should NOT be a line and the element should be a 's +pan' tag $parent_tag ne 'li' && $element->look_up( _tag => 'span' ); } # count_sessions subroutine - this was one attempt at trying to get at + the Session and tutorial links sub count_sessions { my $element = shift; print "Got to count_sessions\n"; my ($parent_tag) = $element->lineage_tag_names; $parent_tag eq 'ul' && ( $element->as_text eq "Session" || $element- +>as_text eq "Tutorial" ); } # in_list subroutine - not presently used sub in_list { my $element = shift; my ($parent_tag) = $element->lineage_tag_names; # Our parent tag should be a line and the element should be a 'span' + tag $parent_tag eq 'li' && $element->look_up( _tag => 'span' ); } # find_top_speakers - placeholder code for subroutine that will find o +ur top 3 speakers sub find_top_speakers { }
In reply to Parse HTML page for links and count by author by Anonymous Monk
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |