Getting the Linking Text from a page

jonjacobmoon has asked for the wisdom of the Perl Monks concerning the following question:

First, let me apology for being so scarce lately. I have been unable to visit PM lately, and feel a little guilty for returning with what is probably a silly question.

I am trying to use HTML::Parser to extract some very specific links from a page but I also need to know what the text is from each link.

So, if a link is Foo, I want to be able to know that "Foo" corresponds to the url is links to. This has got to be easy, but it is late and my brain is numb.

I am using HTML::Parse to get the tags. Here is some code:

#!/usr/bin/perl

use strict;

use lib "/home/jon/perl"; # where BrowserEmulator is

use HTML::Parser;
use BrowserEmulator;  # this gets all the text from the page

my @SSNB;

# start of ParseLink
{
  package ParseLink;
  our @ISA = qw(HTML::Parser);

  # called by parse
  sub start
    {
      my ($this, $tag, $attr) = @_;

      if ($tag eq "a")
    {
      $this->{links}{$attr->{href}} = 1;
    }
    }

  sub get_links
    {
      my $this = shift;
      return keys %{$this->{links}};
    }
}

my $test_url = shift;

my $string = &BrowserEmulator::getFullSource($test_url);

my $p = ParseLink->new;
$p->parse($string);

for ($p->get_links)
  {
       print "LINK: $_\n";
  }
[download]

I admit it, I am Paco.

Comment on Getting the Linking Text from a page Download Code

Replies are listed 'Best First'.

(crazyinsomniac) Re: Getting the Linking Text from a page
by crazyinsomniac (Prior) on Mar 13, 2002 at 08:18 UTC

HTML::TokeParser

HTML::TokeParser::Easy

#!/usr/bin/perl -w
use strict;
use HTML::TokeParser;

my $p = new HTML::TokeParser($ARGV[0]) or die;

while(my $t = $p->get_token()) {
    if($$t[0] eq 'S' and $$t[1] eq 'a') {
       print $$t[2]->{href}, "\n",
             $p->get_trimmed_text('/a'), "\n\n";
    }
}
undef $p;
[download]

______crazyinsomniac_____________________________
Of all the things I've lost, I miss my mind the most.
perl -e "$q=$_;map({chr unpack qq;H*;,$_}split(q;;,q*H*));print;$q/$q;"

[reply]
[d/l]

Re: (crazyinsomniac) Re: Getting the Linking Text from a page

by jonjacobmoon (Pilgrim) on Mar 13, 2002 at 10:35 UTC

[reply]

Re: Getting the Linking Text from a page
by Corion (Patriarch) on Mar 13, 2002 at 08:31 UTC

The approach is a two step approach - you get callbacks for three events, the start of a tag (start_h), the end of a tag (end_h) and anothe callback for any text encountered (text_h). So you need to set up a text handler that will see all text, and modify your start handler such that it increases a counter whenever it enters a <A ... tag, and your end handler such that it decreases that counter.

Your text handler then knows whenever it encounters text from within an anchor.

Some untested code that should replicate what I discussed :

# start of ParseLink
  {
    package ParseLink;
    our @ISA = qw(HTML::Parser);

    # called by parse
    sub start
      {
        my ($this, $tag, $attr) = @_;

        if ($tag eq "a")
        {
          # You might want to check for name="#anchor" links
          # here ...
          $this->{links}{$attr->{href}} = "(no text given)";
          $this->{curr_link} = $attr->{href};
          $this->{nesting_a}++;
        }
      }

    sub end
      {
        my ($this, $tag, $attr) = @_;

        if ($tag eq "a")
        {
          $this->{nesting_a}--;
          $this->{links}{$this->{curr_link}} = $this->{curr_text} 
            if $this->{curr_text};
        }
      }

    sub text {
      my ($this, $text) = @_;
      $this->{curr_text} .= $text if $this->{nesting_a} > 0;
    };

    sub get_links
      {
        my $this = shift;
        return keys %{$this->{links}};
      }
  }
[download]

perl -MHTTP::Daemon -MHTTP::Response -MLWP::Simple -e ' ;    # The  
$d = new HTTP::Daemon and fork and getprint $d->url and exit;#spider
($c = $d->accept())->get_request(); $c->send_response( new   #in the
HTTP::Response(200,$_,$_,qq(Just another Perl hacker\n))); ' #  web
[download]

[reply]
[d/l]
[select]

Re: Getting the Linking Text from a page
by gellyfish (Monsignor) on Mar 13, 2002 at 09:15 UTC

This should give you about 75% of what you need :

#!/usr/bin/perl -w

use strict;

use HTML::Parser;

my $parser = HTML::Parser->new(api_version => 3,
                               start_h => [ \&start,"self,tagname,attr
+" ]);

$parser->parse(<<EOFOO);
<P><A HREF="www.url.com"><I>URL Name</I></A><FONT SIZE="+2">Blah
Blah</FONT><A HREF="www.url.com/url/">Another Link</A></P>
EOFOO

for (@{$parser->{urls}})
{
  print "$_->[0]  $_->[1]\n";
}

sub start
{
   my ($self,$tag,$attr) = @_;

   if ( $tag eq 'a' && exists $attr->{href} )
   {
     $self->{_current_url} = $attr->{href};
     $self->handler(text => sub {
                                  my ( $self,$text ) = @_;
                                  
                                  $self->{_current_text} .= $text; 
                                 },
                                 "self, dtext");

     $self->handler( end => \&end,"self, tagname");
     }
}

sub end
{
    my ( $self, $tag ) = @_;
    if ( $tag eq 'a' )
    {
       push @{$self->{urls}},[$self->{_current_url},
       $self->{_current_text}];
       delete $self->{_current_url};
       delete $self->{_current_text};
   
       $self->handler(text => undef);
       $self->handler(end  => undef);
    }
}
[download]

/J\

[reply]
[d/l]