Using HTTP::LinkExtor to get URL and description info

Popcorn Dave has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: Using HTTP::LinkExtor to get URL and description info
by crazyinsomniac (Prior) on Aug 08, 2002 at 04:39 UTC

Demo

use strict;
use Data::Dumper;
use HTML::LinkExtor;

my $base = 'http://perlmonks.org/';
my $stringy = q{
 <tr><td><a HREF="/index.pl?node_id=188511">How does this code work (w
+arnings.pm)?</a></td> <td>by  <a HREF="/index.pl?node_id=80322">John 
+M. Dlugosz</a></td></tr>
 <tr><td><a HREF="/index.pl?node_id=188509">Tk and X events</a></td> <
+td>by  <a HREF="/index.pl?node_id=961">Anonymous Monk</a></td></tr>
 <tr><td><a HREF="/index.pl?node_id=188507">warnings::warnif etc. wise
+ usage?</a></td> <td>by  <a HREF="/index.pl?node_id=80322">John M. Dl
+ugosz</a></td></tr>
 <tr><td><a HREF="/index.pl?node_id=188505">52-bit numbers as floating
+ point</a></td> <td>by  <a HREF="/index.pl?node_id=80322">John M. Dlu
+gosz</a></td></tr>
};


my $p = new HTML::LinkExtor(undef, $base);

$p->parse($stringy);

print Dumper $p->links;

$p = new HTML::LinkExtor( sub { print Dumper($_) for @_; } , $base);

$p->parse($stringy);
[download]

HTML::TokeParser tutorial

update: suprise, suprise, I've solved this one before (crazyinsomniac) Re: Getting the Linking Text from a page

______crazyinsomniac_____________________________
Of all the things I've lost, I miss my mind the most.
perl -e "$q=$_;map({chr unpack qq;H*;,$_}split(q;;,q*H*));print;$q/$q;"

[reply]
[d/l]

Re: Re: Using HTTP::LinkExtor to get URL and description info

by Popcorn Dave (Abbot) on Aug 08, 2002 at 05:58 UTC

I'll be looking at that tomorrow for certain, but I do have one question. My program is taking headlines off of newspaper sites, but at the moment I'm using LWP::Simple with get(URL), dumping it in to an array, then reading through to a certain pre-determined point, and then using a regex to get the info I want.

Is HTML::TokeParser going to allow me to do that type of thing or will I have to write new "rules" to determine what is a headline and what is just a link on the page?

Thanks again!

Some people fall from grace. I prefer a running start...

[reply]

Re: Using HTTP::LinkExtor to get URL and description info

by bjr (Novice) on Aug 08, 2002 at 17:45 UTC

use HTML::Parser;

$p = new HTML::Parser(start_h => [\&start, "tagname"],
                      end_h => [\&end, "tagname"],
                      default_h => [\&default, "text"]);

$p->parse($some_html);
$p->parsefile(\*SOME_FH);

sub start {
    my ($tagname) = @_;

    $in_a = 1 if $tagname eq 'a';
}

sub end {
    my ($tagname) = @_;

    $in_a = 0 if $tagname eq 'a';
}

sub default {
    my ($text) = @_;

    # do something with text if $in_a
}
[download]

[reply]
[d/l]

OT: FYI, SGML :)
by BorgCopyeditor (Friar) on Aug 08, 2002 at 05:05 UTC

I don't know exactly what you call the descriptive tag between the <a href> and the closing </a> unless it's descriptive tag : ).

Not that this matters, but it's the content of the anchor element. Here's more info about SGML, which is where terms like "tag" and "element" come from.

BCE
--Your punctuation skills are insufficient!

[reply]
[d/l]
[select]

Re: Using HTTP::LinkExtor to get URL and description info
by hacker (Priest) on Aug 08, 2002 at 19:48 UTC

HTML::LinkExtor

use strict;
use HTML::LinkExtor;
my $p = HTML::LinkExtor->new(\&cb, "http://www.perl.org/");
sub cb {
        my($tag, %links) = @_;
        print "$tag @{[%links]}\n";
}
$p->parse_file("index.html");
[download]

[reply]
[d/l]

Re: Re: Using HTTP::LinkExtor to get URL and description info

by Popcorn Dave (Abbot) on Aug 08, 2002 at 20:18 UTC

My app filters out headlines but I want the active link and the text description. As far as I can tell from playing with HTML::LinkExtor, all I can get is the link, not the text and the link. I couldn't work out how you would pull the link text as it has no anchor tag.

HTML::TokeParser seems to be the best way for me to go for my project. However if I've missed something with HTML::LinkExtor, let me know and I'll take another look at that.

Some people fall from grace. I prefer a running start...

[reply]

Re: Using HTTP::LinkExtor to get URL and description info
by OracleJedi (Initiate) on Aug 09, 2002 at 18:25 UTC

use HTML::TreeBuilder;
use strict;  # examples aren't exempt!!!

my $parser = new HTML::TreeBuilder;
$parser->parse($html_code_from_elsewhere);

my @links = $parser->look_down('_tag' => 'a');
foreach my $link (@links) {
   my $href = $link->attr('href');
   my $descr = $link->content->[0];  # Assumes only simple text conten
+ts
}
$parser->delete();
[download]

[reply]
[d/l]

Re: Re: Using HTTP::LinkExtor to get URL and description info

by jordanh (Chaplain) on Aug 10, 2002 at 16:05 UTC

I was also afraid of overkill, but when you don't need the power, you don't have to use it, and it has made a few things really easy compared to what I could do with other tools.

Btw, I think your code could be improved in this way:

use HTML::TreeBuilder;
use strict;  # examples aren't exempt!!!

my $parser = new HTML::TreeBuilder;
$parser->parse($html_code_from_elsewhere);

my @links = $parser->look_down('_tag' => 'a');
foreach my $link (@links) {
   my $href = $link->attr('href');
   my $descr = $link->as_text();  
}
$parser->delete();
[download]

This removes the assumption about only simple text contents and only gets text from the anchor element. Your code would have gotten markup elements embedded in the anchor element, like:

<a href="..."><p class="big-and-bold">Winners!</p> for today</a>
[download]

Fetching $link->content[0] on the above would get you an HTML::Element.

I know you pointed out this limitation, but I think the original Seeker might like to have the as_text() method pointed out as extracting text from HTML appears to be the thing of interest.

[reply]
[d/l]
[select]

Re: Using HTTP::LinkExtor to get URL and description info
by jdavidboyd (Friar) on Aug 14, 2002 at 13:30 UTC


Don't ask to ask, just ask
	PerlMonks