HTML::TokeParser help - parsing headlines

perleager has asked for the wisdom of the Perl Monks concerning the following question:

Hey,

I just decided to embrace on learning everything about LWP :)

I went out to buy the Perl & LWP book to start out with learning some parsing by extracting headlines from a given news site. In this book, the chapter that's about using Tokens to extract headlines; they use the bbc news site for the example site to retrieve headlines. However, since the example no longer works due to the different html coding for each headline, I decided to use reuters news headlines (reuters business section). I'm having a bit trouble with my coding. The problem with the code I'm using is it prints out nothing, therefore I'm figuring I'm not doing the toking part right (I do have all the modules installed).

So first thing to do, I looked for the headlines in the source. I found the pattern goes as:

            <tr><td class="earlyHeadline"><a href="newsArticle.jhtml?t
+ype=businessNews&storyID=4511892&section=news">SEC Targets More Fortu
+ne 500 Names</a></td></tr>

...etc etc as each headline is displayed
[download]

Heres my following code to extract the headlines using HTML::TokeParser :

#!/usr/bin/perl -w

use strict;
use HTML::TokeParser;
use LWP::Simple;

print "Content-type: text/html\n\n";

my $filename = 'temp.html';

open FH, ">$filename";
print FH get("http://www.reuters.com/newsEarlierArticles.jhtml?type=bu
+sinessNews");
close FH;

my $stream = HTML::TokeParser->new('$filename')
  || die "Couldn't read HTML file $filename: $!";

while(my $token = $stream->get_token) {

    if ($token->[0] eq 'S' and $token->[1] eq 'td' and
       ($token->[2]{'class'} || '') eq 'earlyHeadline') {

my(@next) = ($stream->get_token);

if ($next[0] and $next[0][0] eq 'S'  and   $next[0][1] eq 'a' and defi
+ned $next[0][2]{'href'} )  {
    #early headline found for business section/grab a href portion
            print URI->new_abs($next[0][2]{'href'}, $filename), "\n";
            next Token;
        }

    }

  }
[download]

The code looks for the <td class="earlyHeadline">, then the next portion looks for the "a href" part. Then the line where it prints out the url is printing out nothing =(. Can anyone point out what I'm doing wrong? Am I even on the right track?

Thanks,

Anthony

Comment on HTML::TokeParser help - parsing headlines Select or Download Code

Replies are listed 'Best First'.
Re: HTML::TokeParser help - parsing headlines by Enlil (Parson) on Mar 07, 2004 at 01:41 UTC
I believe you are on the right track. The first thing that I see that you are doing wrong is at the following line: `my $stream = HTML::TokeParser->new('$filename') \|\| die "Couldn't read HTML file $filename: $!";` [download] Since `$filename` is enclosed in single quotes it will not interpolate and you are thus looking for a file called literally `$filename` instead of the just created file called: 'temp.html' Second, you have: `print URI->new_abs($next[0][2]{'href'}, $filename), "\n";` [download] But don't have `use URI;` at the top of your file. So the package/method is missing when you call it. Lastly, you have `next Token;`, but you don't have a label Token, and you realistically don't need the next their either as it will immediately go into the next loop whether or not the `next` is there. Upon making these three changes i believe the code will work as you intend. -enlil	[reply] [d/l] [select]
Re: HTML::TokeParser help - parsing headlines by Ovid (Cardinal) on Mar 07, 2004 at 04:21 UTC
If you switch to HTML::TokeParser::Simple, I think you'll be happy with how much clearer the logic is. use strict; use HTML::TokeParser::Simple; use LWP::Simple; use URI; my $url = 'http://www.reuters.com/newsEarlierArticles.jhtml?type=busin +essNews'; my $stream = HTML::TokeParser::Simple->new(\get($url)) \|\| die "Couldn't read $url: $!"; while(my $token = $stream->get_token) { next unless $token->is_start_tag('td') and ($token->return_attr('class') \|\| '') eq 'earlyHeadline'; my $next = $stream->get_token; if ($next->is_start_tag('a')) { print URI->new_abs($next->return_attr('href'), $url), "\n"; } } [download] Cheers, Ovid New address of my CGI Course.	[reply] [d/l]
Re: Re: HTML::TokeParser help - parsing headlines by perleager (Pilgrim) on Mar 07, 2004 at 09:21 UTC
Hey, I adjusted my code correctly to extract the urls from the reauters headlines. However, when printing out the urls it looks like: `http://www.reuters.com/newsArticle.jhtml;jsessionid=1GRGO0RUSCREMCRBAE +0CFFA?type=businessNews&storyID=4512094§ion=news http://www.reuters.com/newsArticle.jhtml;jsessionid=1GRGO0RUSCREMCRBAE +0CFFA?type=businessNews&storyID=4512054§ion=news http://www.reuters.com/newsArticle.jhtml;jsessionid=1GRGO0RUSCREMCRBAE +0CFFA?type=businessNews&storyID=4512041§ion=news` [download] If you copy and paste one of those url's, it will bring you to a blank reuters template, part being because at the end part of the url where it has "§ion=news", should really be "'&'section=news". Somehow its translating the "'&'section=news" into "§ion=news". Could it be because I'm using MIME-Base32 and not MIME-Base64 module? --I'm on a Windows machine. Adjusted code: #!/usr/bin/perl -w use strict; use HTML::TokeParser; use LWP::Simple; use URI; print "Content-type: text/html\n\n"; my $filename = 'temp.html'; open FH, ">$filename"; print FH get("http://www.reuters.com/newsEarlierArticles.jhtml?type=bu +sinessNews"); close FH; my $stream = HTML::TokeParser->new($filename) \|\| die "Couldn't read HTML file $filename: $!"; while(my $token = $stream->get_token) { if ($token->[0] eq 'S' and $token->[1] eq 'td' and ($token->[2]{'class'} \|\| '') eq 'earlyHeadline') { my(@next) = ($stream->get_token); if ($next[0] and $next[0][0] eq 'S' and $next[0][1] eq 'a' and defined + $next[0][2]{'href'} ) { #early headline found for business section/grab a href portion print URI->new_abs($next[0][2]{'href'}, 'http://www.reuter +s.com/'), "\n"; } } } [download] Thank you, Anthony	[reply] [d/l] [select]
Re: HTML::TokeParser help - parsing headlines by Popcorn Dave (Abbot) on Mar 07, 2004 at 01:26 UTC
Take a look at my scratchpad. There's a Perl program there let you see exactly what you're getting from HTML::TokeParser. You'll quickly see what tokens are assigned where and what you need to look for in the web source. I used it when I was doing something very similar to what you're doing for parsing headlines on multiple web sites and it made the whole process quite easy. Hope that helps! Update: Thanks to suggestions from b10m and graff I'm including the code here so future monks can find it in a super search. #!/usr/bin/perl -w # HTML::TokeParser dumper # # quick & dirty code to print out TokeParser output use strict; use HTML::TokeParser; use LWP::Simple; print "Content-type: text/html\n\n"; my $filename = 'temp.html'; open FH, ">$filename"; print FH get("http://www.buchanie.co.uk/news.asp"); close FH; my $stream = HTML::TokeParser->new($filename) \|\| die "Couldn't read HTML file $filename: $!"; while(my $token = $stream->get_token) { if ($token->[0] eq "S"){ print "Token:S 1:$token->[1]\n"; foreach my $key(keys %{$token->[2]}){ print "Key: $key Value: ${$token->[2]}{$key}\n"; } print "3: @{$token->[3]}\n4: $token->[4]\n\n"; } elsif ($token->[0] eq "E"){ print "Token:E 1:$token->[1] 2: $token->[2]\n\n"; } elsif ($token->[0] eq "T"){ print "Token:T 1:$token->[1]\n\n"; } elsif ($token->[0] eq "C"){ print "Token:C 1:$token->[1]\n\n"; } elsif ($token->[0] eq "D"){ print "Token:D 1:$token->[1]\n\n"; } else {print "Unknown token $token\n\n";} } [download] There is no emoticon for what I'm feeling now.	[reply] [d/l]
Re: Re: HTML::TokeParser help - parsing headlines by graff (Chancellor) on Mar 07, 2004 at 05:36 UTC
Rather than providing a link to your scratchpad, why not post that code in some more stable wing of the Monastery (or include it in your reply), to make it a stable reference? People are likely to find this thread in a search for tips on HTML parsing at any time over the coming months or years, and you're likely to have put something else on your scratch pad by then...	[reply]
Re: Re: Re: HTML::TokeParser help - parsing headlines by Popcorn Dave (Abbot) on Mar 07, 2004 at 20:17 UTC
Actually b10m suggested the same thing so I'm taking the advice of both of you and updating my node. :) There is no emoticon for what I'm feeling now.	[reply]
Re: HTML::TokeParser help - parsing headlines by sheep (Chaplain) on Mar 07, 2004 at 01:50 UTC
Hello, One additional thing to what Enlil said: `URI->new_abs($next[0][2]{'href'}, $filename)`, you are calling it with `$filename` as the base, but the base for your URL is "http://www.reuters.com/", not your temporary file name. -Sheep	[reply] [d/l] [select]