I'm using code basically ripped from Lincoln Stein's book to get a list of links on web pages. Here's the meat:
sub start {
my ($parser,$tag,$attr) = @_;
$parser->{last_tag} = $tag;
return unless $tag eq 'a';
$parser->{attr} = $attr->{href};
$parser->handler(text => \&extract, 'self,attr,dtext');
$parser->handler(end => \&end, 'self,tagname');
}
sub end {
my ($parser,$tag) = @_;
undef $parser->{last_tag};
return unless $tag eq 'a';
$parser->handler(text => undef);
$parser->handler(end => undef);
}
sub extract {
my ($parser,$attr,$text) = @_;
if ($parser->{last_tag} eq 'a') {
if ($parser->{attr} && $text && $text !~ /^\s*$/) {
$text =~ s/\n*//g;
$parser->{attr} =~ s/\n*//g;;
push @array, $text;
push @array, $parser->{attr};
}
}
}
It seemed to work beautifully until it choked on the following bit of html which is all on one line:
<font size=+1><b><A HREF="/dailyglobe2/142/metro/Plan_adopts_Romney_s_
+ideas_on_higher_education_restructuring+.shtml">Plan adopts Romney's
+ideas on higher education restructuring</a></b></font><br>
For some reason, the parser object is reading the single
<a> tag in the code above as two
<a> tags. It says the first tag contains the text "Plan adopts Romney's" and the second tag contains "ideas on higher education".
I'm stumped. Like I said, the parser seems to work on every other html hyperlink it finds. And it's not that plus sign in the link because the parser works on other links with the plus sign.
Does anyone see why the parser is seeing two links here? Thanks.
$PM = "Perl Monk's";
$MCF = "Most Clueless Friar Abbot Bishop";
$nysus = $PM . $MCF;
Click here if you love Perl Monks
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.