regex to match content not inside an HTML anchor or other tags

GregHurrell has asked for the wisdom of the Perl Monks concerning the following question:

I'm working on some Wiki-like auto-linking code that scans text for known words or strings and when found replaces them with HTML hyperlinks. For example, if "MySQL" is in the list of known strings then the code turns that word into a hyperlink when it is found in a sentence like "using MySQL or another database".

I am trying to come up with a regex that will perform this substitution but only when the string is:

not found in between a pair of anchor tags (<a href...> ... </a>).
not inside a pair of angle brackets (for example if the string "foo" appears in the "alt" attribute of an "IMG" tag I obviously don't want to turn it into a hyperlink!).

Without these special provisions if someone ever manually wraps the word MySQL (or a sentence containing it) inside anchor tags then I end up with nested anchor tags which are invalid HTML.

I've seen various regexps for matching anchors or other tags, but I can't figure out how to match something that's not inside an anchor or a tag... I've tried all sorts of nasty look-behind/look-ahead stuff but nothing that works yet. Sometimes it gets so ugly that I start wondering if I have to write some kind of recursive HTML tokenizer (ugh)... Any ideas?

Comment on regex to match content not inside an HTML anchor or other tags

Replies are listed 'Best First'.
Re: regex to match content not inside an HTML anchor or other tags by Ido (Hermit) on Jun 27, 2005 at 10:45 UTC
Don't try to reinvent the wheel, use one of the CPAN modules for HTML parsing. You could easily use HTML::TokeParser::Simple. I think maybe someone has already done what you're trying to do, check out HTML::LinkAdd too.	[reply]
Re: regex to match content not inside an HTML anchor or other tags by ww (Archbishop) on Jun 27, 2005 at 12:41 UTC
...perhaps a mere detail, but whence cometh the target address if you're seeking words which are NOT addresses? (The rest of your post suggests you already know this, but your example, '"using MySQL",' won't work -- alone -- as a link and if you're using a __DATA__ set or similar to provide appropriate links for specific words or phrases from an unlimited set of possible, you're going to have issues other than those posed here.) So show us/tell us a bit more about your efforts and algorithm.	[reply]
Re: regex to match content not inside an HTML anchor or other tags by GregHurrell (Initiate) on Jun 27, 2005 at 14:44 UTC
Ok, problem solved. In the end I did use a (very simplistic) tokenizer. I didn't post source code with my original question because I'm actually working in PHP (but with Perl compatible regular expressions). I wanted a regex-only solution, but in the end tokenizer+regex seemed to be the shortest and most robust solution. If you follow this link you'll see a more detailed explanation and there's a link to the source. http://greghurrell.net/wp/2005/06/27/autolink-plug-in/	[reply]