aemain has asked for the wisdom of the Perl Monks concerning the following question:

Hey Monks,

I've spent a fair amount of time doing somewhat basic regular expressions, but this one really has me stumped. I'm writing an web encyclopedia feature, such that the first occurrence of a given word is turned into a hyperlink to the encyclopedia entry. The problem is that the text being replaced may already have hyperlinks in it, and that some encyclopedia terms include other terms. For example, simplistic regular expressions fail when asked to add links to Globalization on this text:

Oh no, <a href="/encyclopedia/Anti-Globalization/index.html">Anti-glob +alization</a> activists are coming! Globalization is rejected by...

In this case, we would obviously want the script to ignore the url and the linked text, and instead place the new link in the last sentence.

Thanks so much

Replies are listed 'Best First'.
Re: Searching for text not inside a hyperlink
by valdez (Monsignor) on Jun 06, 2004 at 19:54 UTC

    See HTML::TokeParser::Simple:

    #!/usr/bin/perl use strict; use warnings; use HTML::TokeParser::Simple; my $open = 0; my $match = 'Globalization'; my $uri = 'http://example.com/glossary?globalization'; my $p = HTML::TokeParser::Simple->new(*DATA); while (my $t = $p->get_token) { if ($t->is_start_tag('a')) { $open++; print $t->as_is; } elsif ($t->is_end_tag('a')) { $open--; print $t->as_is; } elsif ($t->is_text) { my $text = $t->as_is; if ($text =~ /$match/) { if (not $open) { my $href = qq{<a href="$uri">$match</a>}; $text =~ s/$match/$href/; print $text; } else { print $text; } } else { print $text; } } else { print $t->as_is; } } __DATA__ Oh no, <a href="/encyclopedia/Anti-Globalization/index.html">Anti-glob +alization</a> activists are coming! Globalization is rejected by...
    which prints:
    Oh no, <a href="/encyclopedia/Anti-Globalization/index.html">Anti-glob +alization</a> activists are coming! <a href="http://example.com/glossary?globalization">Globalization</a> +is rejected by...
    This should be enough to get you started :)

    Ciao, Valerio

Re: Searching for text not inside a hyperlink
by Zaxo (Archbishop) on Jun 06, 2004 at 18:35 UTC
Re: Searching for text not inside a hyperlink
by bart (Canon) on Jun 07, 2004 at 07:33 UTC
    Your problem is not searching inside HTML links, but replacing inside what you already replaced in a previous loop. So: don't do multiple loops, instead, replace everything in one go. Build a regex with all search terms first, and do the substitution with a hash. Something like this:
    $_ = <<'--'; Oh no, Anti-globalization activists are coming! Globalization is rejected by... -- use Regex::PreSuf; %links = ( 'globalization' => '/encyclopedia/Globalization/index.html', 'anti-globalization' => '/encyclopedia/Anti-Globalization/index.htm +l' ); my $re = presuf(keys %links); s/($re)/<a href="$links{lc $1}">$1<\/a>/gio; print;
    Result:
    Oh no, <a href="/encyclopedia/Anti-Globalization/index.html">Anti-glob +alization</a> activists are coming! <a href="/encyclopedia/Globalization/index.html">Globalization</a> is +rejected by...
    Regex::PreSuf is a module to build a regex out of a list of words. It'll escape metacharacters, so the resulting regex will always just do literal lookups.

    Do note how I made the replacement case-insensitive, making the keys of the hash lower case, and doing the lookup with lc $1 — in addition to use of the /i switch.