parse out anchor text

tsu has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks, I am trying to parse out 'a href' tags, their anchor text. Also, if possible, I'd like to get a few words on each side of the tag, but that is less important. I am able to use HTML::TokeParser::Simple to get a list of 'a href' tags, but I am unable to figure out how I would go about getting the anchor text. Does anyone have any suggestions? Thanks, ts

Comment on parse out anchor text

Replies are listed 'Best First'.
Re: parse out anchor text by reneeb (Chaplain) on Jan 11, 2005 at 20:05 UTC
I would use HTML::Parser: #! /usr/bin/perl use strict; use warnings; use HTML::Parser; my $string = qq~.:<script>text</script> .:<script>text34</script> <div + class="bbcode">sammler schrieb:</div><a href="http//url.tld">test</a +>~; my $p = HTML::Parser->new(); $p->handler(start => \&start_handler,"tagname,attr,self"); $p->parse($string); sub start_handler{ return if(shift ne 'a'); print shift->{href}; my $self = shift; my $text; $self->handler(text => sub{$text = shift;},"dtext"); $self->handler(end => sub{print $text,"\n\n" if(shift eq 'a')},"tagn +ame"); } [download] It's untested.	[reply] [d/l]
Re^2: parse out anchor text by tsu (Novice) on Jan 11, 2005 at 20:15 UTC
thanks! it's works perfectly.	[reply]
Re: parse out anchor text by brian_d_foy (Abbot) on Jan 11, 2005 at 20:13 UTC
The HTML::LinkExtractor module can get you the text in the anchor tags. There is an example in the docs which shows which options to enable. -- brian d foy <bdfoy@cpan.org>	[reply]