Searching for text not inside a hyperlink

aemain has asked for the wisdom of the Perl Monks concerning the following question:

Hey Monks,

I've spent a fair amount of time doing somewhat basic regular expressions, but this one really has me stumped. I'm writing an web encyclopedia feature, such that the first occurrence of a given word is turned into a hyperlink to the encyclopedia entry. The problem is that the text being replaced may already have hyperlinks in it, and that some encyclopedia terms include other terms. For example, simplistic regular expressions fail when asked to add links to Globalization on this text:

Oh no, <a href="/encyclopedia/Anti-Globalization/index.html">Anti-glob
+alization</a> activists are coming!
Globalization is rejected by...
[download]

In this case, we would obviously want the script to ignore the url and the linked text, and instead place the new link in the last sentence.

Thanks so much

Comment on Searching for text not inside a hyperlink Download Code

Replies are listed 'Best First'.
Re: Searching for text not inside a hyperlink by valdez (Monsignor) on Jun 06, 2004 at 19:54 UTC
See HTML::TokeParser::Simple: #!/usr/bin/perl use strict; use warnings; use HTML::TokeParser::Simple; my $open = 0; my $match = 'Globalization'; my $uri = 'http://example.com/glossary?globalization'; my $p = HTML::TokeParser::Simple->new(*DATA); while (my $t = $p->get_token) { if ($t->is_start_tag('a')) { $open++; print $t->as_is; } elsif ($t->is_end_tag('a')) { $open--; print $t->as_is; } elsif ($t->is_text) { my $text = $t->as_is; if ($text =~ /$match/) { if (not $open) { my $href = qq{<a href="$uri">$match</a>}; $text =~ s/$match/$href/; print $text; } else { print $text; } } else { print $text; } } else { print $t->as_is; } } __DATA__ Oh no, <a href="/encyclopedia/Anti-Globalization/index.html">Anti-glob +alization</a> activists are coming! Globalization is rejected by... [download] which prints: `Oh no, <a href="/encyclopedia/Anti-Globalization/index.html">Anti-glob +alization</a> activists are coming! <a href="http://example.com/glossary?globalization">Globalization</a> +is rejected by...` [download] This should be enough to get you started :) Ciao, Valerio	[reply] [d/l] [select]
Re: Searching for text not inside a hyperlink by Zaxo (Archbishop) on Jun 06, 2004 at 18:35 UTC
See HTML::Parser. After Compline, Zaxo	[reply]
Re: Searching for text not inside a hyperlink by bart (Canon) on Jun 07, 2004 at 07:33 UTC
Your problem is not searching inside HTML links, but replacing inside what you already replaced in a previous loop. So: don't do multiple loops, instead, replace everything in one go. Build a regex with all search terms first, and do the substitution with a hash. Something like this: `$_ = <<'--'; Oh no, Anti-globalization activists are coming! Globalization is rejected by... -- use Regex::PreSuf; %links = ( 'globalization' => '/encyclopedia/Globalization/index.html', 'anti-globalization' => '/encyclopedia/Anti-Globalization/index.htm +l' ); my $re = presuf(keys %links); s/($re)/<a href="$links{lc $1}">$1<\/a>/gio; print;` [download] Result: `Oh no, <a href="/encyclopedia/Anti-Globalization/index.html">Anti-glob +alization</a> activists are coming! <a href="/encyclopedia/Globalization/index.html">Globalization</a> is +rejected by...` [download] Regex::PreSuf is a module to build a regex out of a list of words. It'll escape metacharacters, so the resulting regex will always just do literal lookups. Do note how I made the replacement case-insensitive, making the keys of the hash lower case, and doing the lookup with `lc $1` — in addition to use of the `/i` switch.	[reply] [d/l] [select]