Re: Re: Using HTTP::LinkExtor to get URL and description info

Thanks for that!

I'll be looking at that tomorrow for certain, but I do have one question. My program is taking headlines off of newspaper sites, but at the moment I'm using LWP::Simple with get(URL), dumping it in to an array, then reading through to a certain pre-determined point, and then using a regex to get the info I want.

Is HTML::TokeParser going to allow me to do that type of thing or will I have to write new "rules" to determine what is a headline and what is just a link on the page?

Thanks again!

Some people fall from grace. I prefer a running start...

Comment on Re: Re: Using HTTP::LinkExtor to get URL and description info

Replies are listed 'Best First'.
Re: Using HTTP::LinkExtor to get URL and description info by bjr (Novice) on Aug 08, 2002 at 17:45 UTC
I would suggest the CPAN module HTML::Parser. It's pretty straightforward: `use HTML::Parser; $p = new HTML::Parser(start_h => [\&start, "tagname"], end_h => [\&end, "tagname"], default_h => [\&default, "text"]); $p->parse($some_html); $p->parsefile(\*SOME_FH); sub start { my ($tagname) = @_; $in_a = 1 if $tagname eq 'a'; } sub end { my ($tagname) = @_; $in_a = 0 if $tagname eq 'a'; } sub default { my ($text) = @_; # do something with text if $in_a }` [download] HTH. Off the top of my head. Check the HTML::Parser PoD for absolute correctness.	[reply] [d/l]

Replies are listed 'Best First'.

Re: Using HTTP::LinkExtor to get URL and description info
by bjr (Novice) on Aug 08, 2002 at 17:45 UTC

use HTML::Parser;

$p = new HTML::Parser(start_h => [\&start, "tagname"],
                      end_h => [\&end, "tagname"],
                      default_h => [\&default, "text"]);

$p->parse($some_html);
$p->parsefile(\*SOME_FH);

sub start {
    my ($tagname) = @_;

    $in_a = 1 if $tagname eq 'a';
}

sub end {
    my ($tagname) = @_;

    $in_a = 0 if $tagname eq 'a';
}

sub default {
    my ($text) = @_;

    # do something with text if $in_a
}
[download]

[reply]
[d/l]