tsu has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks, I am trying to parse out 'a href' tags, their anchor text. Also, if possible, I'd like to get a few words on each side of the tag, but that is less important. I am able to use HTML::TokeParser::Simple to get a list of 'a href' tags, but I am unable to figure out how I would go about getting the anchor text. Does anyone have any suggestions? Thanks, ts

Replies are listed 'Best First'.
Re: parse out anchor text
by reneeb (Chaplain) on Jan 11, 2005 at 20:05 UTC
    I would use HTML::Parser:

    #! /usr/bin/perl use strict; use warnings; use HTML::Parser; my $string = qq~.:<script>text</script> .:<script>text34</script> <div + class="bbcode">sammler schrieb:</div><a href="http//url.tld">test</a +>~; my $p = HTML::Parser->new(); $p->handler(start => \&start_handler,"tagname,attr,self"); $p->parse($string); sub start_handler{ return if(shift ne 'a'); print shift->{href}; my $self = shift; my $text; $self->handler(text => sub{$text = shift;},"dtext"); $self->handler(end => sub{print $text,"\n\n" if(shift eq 'a')},"tagn +ame"); }
    It's untested.
      thanks! it's works perfectly.
Re: parse out anchor text
by brian_d_foy (Abbot) on Jan 11, 2005 at 20:13 UTC

    The HTML::LinkExtractor module can get you the text in the anchor tags. There is an example in the docs which shows which options to enable.

    --
    brian d foy <bdfoy@cpan.org>