in reply to Re: Using HTTP::LinkExtor to get URL and description info
in thread Using HTTP::LinkExtor to get URL and description info

Thanks for that!

I'll be looking at that tomorrow for certain, but I do have one question. My program is taking headlines off of newspaper sites, but at the moment I'm using LWP::Simple with get(URL), dumping it in to an array, then reading through to a certain pre-determined point, and then using a regex to get the info I want.

Is HTML::TokeParser going to allow me to do that type of thing or will I have to write new "rules" to determine what is a headline and what is just a link on the page?

Thanks again!

Some people fall from grace. I prefer a running start...

  • Comment on Re: Re: Using HTTP::LinkExtor to get URL and description info

Replies are listed 'Best First'.
Re: Using HTTP::LinkExtor to get URL and description info
by bjr (Novice) on Aug 08, 2002 at 17:45 UTC
    I would suggest the CPAN module HTML::Parser. It's pretty straightforward:
    use HTML::Parser; $p = new HTML::Parser(start_h => [\&start, "tagname"], end_h => [\&end, "tagname"], default_h => [\&default, "text"]); $p->parse($some_html); $p->parsefile(\*SOME_FH); sub start { my ($tagname) = @_; $in_a = 1 if $tagname eq 'a'; } sub end { my ($tagname) = @_; $in_a = 0 if $tagname eq 'a'; } sub default { my ($text) = @_; # do something with text if $in_a }
    HTH. Off the top of my head. Check the HTML::Parser PoD for absolute correctness.