Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Using HTTP::LinkExtor to get URL and description info

by Popcorn Dave (Abbot)
on Aug 08, 2002 at 04:13 UTC ( [id://188518]=perlquestion: print w/replies, xml ) Need Help??

Popcorn Dave has asked for the wisdom of the Perl Monks concerning the following question:

Fellow Monks,

Sorry if that title is a bit unwieldly but I don't know exactly what you call the descriptive tag between the  <a href> and the closing </a> unless it's descriptive tag : ).

My question is: Is there some way to use HTTP::LinkExtor to get that particular information? I looked at the docs and played around a bit with the example and I can pull html links out all day long. What I can't do is get the descriptive information for the links.

Can someone nudge me in a direction on this if it's been done, or point me to what I may be missing here?

Thanks in advance!

Some people fall from grace. I prefer a running start...

Replies are listed 'Best First'.
Re: Using HTTP::LinkExtor to get URL and description info
by crazyinsomniac (Prior) on Aug 08, 2002 at 04:39 UTC
    You have to know your tools. HTML::LinkExtor was designed to only extract the links, not the text in between (whatever you call it, cdata or whatever).

    Demo

    use strict; use Data::Dumper; use HTML::LinkExtor; my $base = 'http://perlmonks.org/'; my $stringy = q{ <tr><td><a HREF="/index.pl?node_id=188511">How does this code work (w +arnings.pm)?</a></td> <td>by <a HREF="/index.pl?node_id=80322">John +M. Dlugosz</a></td></tr> <tr><td><a HREF="/index.pl?node_id=188509">Tk and X events</a></td> < +td>by <a HREF="/index.pl?node_id=961">Anonymous Monk</a></td></tr> <tr><td><a HREF="/index.pl?node_id=188507">warnings::warnif etc. wise + usage?</a></td> <td>by <a HREF="/index.pl?node_id=80322">John M. Dl +ugosz</a></td></tr> <tr><td><a HREF="/index.pl?node_id=188505">52-bit numbers as floating + point</a></td> <td>by <a HREF="/index.pl?node_id=80322">John M. Dlu +gosz</a></td></tr> }; my $p = new HTML::LinkExtor(undef, $base); $p->parse($stringy); print Dumper $p->links; $p = new HTML::LinkExtor( sub { print Dumper($_) for @_; } , $base); $p->parse($stringy);
    And now for the nudge, HTML::TokeParser tutorial

    update: suprise, suprise, I've solved this one before (crazyinsomniac) Re: Getting the Linking Text from a page

     
    ______crazyinsomniac_____________________________
    Of all the things I've lost, I miss my mind the most.
    perl -e "$q=$_;map({chr unpack qq;H*;,$_}split(q;;,q*H*));print;$q/$q;"

      Thanks for that!

      I'll be looking at that tomorrow for certain, but I do have one question. My program is taking headlines off of newspaper sites, but at the moment I'm using LWP::Simple with get(URL), dumping it in to an array, then reading through to a certain pre-determined point, and then using a regex to get the info I want.

      Is HTML::TokeParser going to allow me to do that type of thing or will I have to write new "rules" to determine what is a headline and what is just a link on the page?

      Thanks again!

      Some people fall from grace. I prefer a running start...

        I would suggest the CPAN module HTML::Parser. It's pretty straightforward:
        use HTML::Parser; $p = new HTML::Parser(start_h => [\&start, "tagname"], end_h => [\&end, "tagname"], default_h => [\&default, "text"]); $p->parse($some_html); $p->parsefile(\*SOME_FH); sub start { my ($tagname) = @_; $in_a = 1 if $tagname eq 'a'; } sub end { my ($tagname) = @_; $in_a = 0 if $tagname eq 'a'; } sub default { my ($text) = @_; # do something with text if $in_a }
        HTH. Off the top of my head. Check the HTML::Parser PoD for absolute correctness.
OT: FYI, SGML :)
by BorgCopyeditor (Friar) on Aug 08, 2002 at 05:05 UTC

    I don't know exactly what you call the descriptive tag between the <a href> and the closing </a> unless it's descriptive tag : ).

    Not that this matters, but it's the content of the anchor element. Here's more info about SGML, which is where terms like "tag" and "element" come from.

    BCE
    --Your punctuation skills are insufficient!

Re: Using HTTP::LinkExtor to get URL and description info
by hacker (Priest) on Aug 08, 2002 at 19:48 UTC
    I think you want HTML::LinkExtor..
    use strict; use HTML::LinkExtor; my $p = HTML::LinkExtor->new(\&cb, "http://www.perl.org/"); sub cb { my($tag, %links) = @_; print "$tag @{[%links]}\n"; } $p->parse_file("index.html");
      I know that I got the module name wrong, but for what I want to do, HTML::LinkExtor doesn't go far enough.

      My app filters out headlines but I want the active link and the text description. As far as I can tell from playing with HTML::LinkExtor, all I can get is the link, not the text and the link. I couldn't work out how you would pull the link text as it has no anchor tag.

      HTML::TokeParser seems to be the best way for me to go for my project. However if I've missed something with HTML::LinkExtor, let me know and I'll take another look at that.

      Some people fall from grace. I prefer a running start...

Re: Using HTTP::LinkExtor to get URL and description info
by OracleJedi (Initiate) on Aug 09, 2002 at 18:25 UTC
    HTML::TreeBuilder might be overkill for what you need, but it's simple:
    use HTML::TreeBuilder; use strict; # examples aren't exempt!!! my $parser = new HTML::TreeBuilder; $parser->parse($html_code_from_elsewhere); my @links = $parser->look_down('_tag' => 'a'); foreach my $link (@links) { my $href = $link->attr('href'); my $descr = $link->content->[0]; # Assumes only simple text conten +ts } $parser->delete();
      I've been doing some web automation and I'm using HTML::TreeBuilder everywhere.

      I was also afraid of overkill, but when you don't need the power, you don't have to use it, and it has made a few things really easy compared to what I could do with other tools.

      Btw, I think your code could be improved in this way:

      use HTML::TreeBuilder; use strict; # examples aren't exempt!!! my $parser = new HTML::TreeBuilder; $parser->parse($html_code_from_elsewhere); my @links = $parser->look_down('_tag' => 'a'); foreach my $link (@links) { my $href = $link->attr('href'); my $descr = $link->as_text(); } $parser->delete();

      This removes the assumption about only simple text contents and only gets text from the anchor element. Your code would have gotten markup elements embedded in the anchor element, like:

      <a href="..."><p class="big-and-bold">Winners!</p> for today</a>

      Fetching $link->content[0] on the above would get you an HTML::Element.

      I know you pointed out this limitation, but I think the original Seeker might like to have the as_text() method pointed out as extracting text from HTML appears to be the thing of interest.

Re: Using HTTP::LinkExtor to get URL and description info
by jdavidboyd (Friar) on Aug 14, 2002 at 13:30 UTC
    Hey, HTML::TokeParser comes with a sample script that works great.
    (In the EXAMPLES section)

    I cut it out, and have been able to use it, (with a slight amount of rework) to recreate all my bookmarks into a clean sorted file.
    Dave
      Thanks, I'll check that out. However my problem is that on some of the web pages I'm parsing they have text I want to grab, then the link after that with the anchor text being just "more" so that makes it a bit more difficult.

      Some people fall from grace. I prefer a running start...

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://188518]
Approved by myocom
Front-paged by wil
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (5)
As of 2024-04-25 13:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found