jonjacobmoon has asked for the wisdom of the Perl Monks concerning the following question:

First, let me apology for being so scarce lately. I have been unable to visit PM lately, and feel a little guilty for returning with what is probably a silly question.

I am trying to use HTML::Parser to extract some very specific links from a page but I also need to know what the text is from each link.

So, if a link is Foo, I want to be able to know that "Foo" corresponds to the url is links to. This has got to be easy, but it is late and my brain is numb.

I am using HTML::Parse to get the tags. Here is some code:

#!/usr/bin/perl use strict; use lib "/home/jon/perl"; # where BrowserEmulator is use HTML::Parser; use BrowserEmulator; # this gets all the text from the page my @SSNB; # start of ParseLink { package ParseLink; our @ISA = qw(HTML::Parser); # called by parse sub start { my ($this, $tag, $attr) = @_; if ($tag eq "a") { $this->{links}{$attr->{href}} = 1; } } sub get_links { my $this = shift; return keys %{$this->{links}}; } } my $test_url = shift; my $string = &BrowserEmulator::getFullSource($test_url); my $p = ParseLink->new; $p->parse($string); for ($p->get_links) { print "LINK: $_\n"; }


I admit it, I am Paco.

Replies are listed 'Best First'.
(crazyinsomniac) Re: Getting the Linking Text from a page
by crazyinsomniac (Prior) on Mar 13, 2002 at 08:18 UTC
      Thanks Crazy. That was it precisely. Totally forgot about HTML:;TokeParser. I feel stupid, but I am happy to be able to move onto the next problem.

      :)


      I admit it, I am Paco.
Re: Getting the Linking Text from a page
by Corion (Patriarch) on Mar 13, 2002 at 08:31 UTC

    The approach is a two step approach - you get callbacks for three events, the start of a tag (start_h), the end of a tag (end_h) and anothe callback for any text encountered (text_h). So you need to set up a text handler that will see all text, and modify your start handler such that it increases a counter whenever it enters a <A ... tag, and your end handler such that it decreases that counter.

    Your text handler then knows whenever it encounters text from within an anchor.

    Some untested code that should replicate what I discussed :

    # start of ParseLink { package ParseLink; our @ISA = qw(HTML::Parser); # called by parse sub start { my ($this, $tag, $attr) = @_; if ($tag eq "a") { # You might want to check for name="#anchor" links # here ... $this->{links}{$attr->{href}} = "(no text given)"; $this->{curr_link} = $attr->{href}; $this->{nesting_a}++; } } sub end { my ($this, $tag, $attr) = @_; if ($tag eq "a") { $this->{nesting_a}--; $this->{links}{$this->{curr_link}} = $this->{curr_text} if $this->{curr_text}; } } sub text { my ($this, $text) = @_; $this->{curr_text} .= $text if $this->{nesting_a} > 0; }; sub get_links { my $this = shift; return keys %{$this->{links}}; } }
    perl -MHTTP::Daemon -MHTTP::Response -MLWP::Simple -e ' ; # The $d = new HTTP::Daemon and fork and getprint $d->url and exit;#spider ($c = $d->accept())->get_request(); $c->send_response( new #in the HTTP::Response(200,$_,$_,qq(Just another Perl hacker\n))); ' # web
Re: Getting the Linking Text from a page
by gellyfish (Monsignor) on Mar 13, 2002 at 09:15 UTC

    This should give you about 75% of what you need :

    #!/usr/bin/perl -w use strict; use HTML::Parser; my $parser = HTML::Parser->new(api_version => 3, start_h => [ \&start,"self,tagname,attr +" ]); $parser->parse(<<EOFOO); <P><A HREF="www.url.com"><I>URL Name</I></A><FONT SIZE="+2">Blah Blah</FONT><A HREF="www.url.com/url/">Another Link</A></P> EOFOO for (@{$parser->{urls}}) { print "$_->[0] $_->[1]\n"; } sub start { my ($self,$tag,$attr) = @_; if ( $tag eq 'a' && exists $attr->{href} ) { $self->{_current_url} = $attr->{href}; $self->handler(text => sub { my ( $self,$text ) = @_; $self->{_current_text} .= $text; }, "self, dtext"); $self->handler( end => \&end,"self, tagname"); } } sub end { my ( $self, $tag ) = @_; if ( $tag eq 'a' ) { push @{$self->{urls}},[$self->{_current_url}, $self->{_current_text}]; delete $self->{_current_url}; delete $self->{_current_text}; $self->handler(text => undef); $self->handler(end => undef); } }
    Hope that helps.

    /J\