Tricky regexp...

ti-source has asked for the wisdom of the Perl Monks concerning the following question:

This will probably seem simple to you folks out there. Unfortunately, it's not for me. Here it goes...

I've got this site search engine. Works real good. Fast enough, etc. When a user selects a document or page from the list of search results, another script is supposed to hilight or mark the search query so that when they are reading the page, the search query pops out at them.

I have a chunk of code that does the replacement $word with <htmlcode>$word</htmlcode>. That works just fine.

To ensure that html wouldn't be screwed up, my search script also strips all html from the file its searching (not writing to disk, just while its searching). This way, the html code won't produce invalid results (like searching for 'font' or 'table' etc.) so the replacement afterwards doesn't conflict with the HTML (like <<htmlcode>font color=blue>some text</htmlcode>font</htmlcode>>).

This may not sound like such an issue until one searches for something like webmaster, which is not only in pages as normal text, but also in HTML. When I search for webmaster, I get the following screwed up HTML: <a href="mailto:<htmlcode>webmaster</htmlcode>@domain.com"><htmlcode>webmaster</htmlcode>@domain.com</a>.

Anyway, to make a long story short, is there a way to negate a regexp so that a string replacement happens when something isn't true? Like 'don't replace this text if its inside < and > (a tag attribute)?? Or is there an easier way to do this?

Any help would be appreciated.

Jason
tisource_webmaster@yahoo.com

Edit by tye

Comment on Tricky regexp... Select or Download Code

Replies are listed 'Best First'.
Re: Tricky regexp... by Juerd (Abbot) on Dec 31, 2001 at 04:42 UTC
There are many ways of doing so. Regexes aren't the answer, because you can't have variable width look-back assertions with Perl. I think the easiest way to do this is to use HTML::Parser with only `text` tokens. Have your text handler do the substitutions. Another way would be stripping htmltags and storing them in an array or something. But matching HTML is harder than it seems, so I'd go for HTML::Parser `2;0 juerd@ouranos:~$ perl -e'undef christmas' Segmentation fault 2;139 juerd@ouranos:~$` [download]	[reply] [d/l]
Re: Tricky regexp... by thunders (Priest) on Dec 31, 2001 at 20:08 UTC
You could replace occurances of "webmaster" that are not proceeded by a "mailto:" with look behind assertion like this: `$html =~ s#(?<!mailto:)(webmaster)#<htmlcode>$1</htmlcode>#gs;` [download] for anything less specific than that you should use a module to parse html.	[reply] [d/l]
Re: Tricky regexp... by thunders (Priest) on Dec 31, 2001 at 21:36 UTC
Here is a very generic example of HTML::Parser applied to your problem. replace $html with your HTML, and s///g with your pattern replacement and you should be on your way. `#!/usr/bin/perl -w use strict; use HTML::Parser; my $html = "<HTML> from a file or wherever</HTML>"; my @parsed; my $p = HTML::Parser->new(api_version=> 3, handlers=> {default=>[\@parsed,"event,text"] +}, ); $p->parse($html); for (@parsed) { $_->[1] =~ s///g if $_->[0] eq 'text'; }` [download]	[reply] [d/l]