joealba has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

I'm trying to make my site search highlight keywords in the search results, similar to the groups.google.com search. However, I only want to highlight these keywords if they are not in an HTML tag (like an href or a CSS style).

Here's my current (quite elementary) code: $newline =~ s|\b$keyword\b|<font style="BACKGROUND: #ffffff">$keyword</font>|img; I'm looking for the optimal solution, but my brain doesn't want to work today. Any ideas?
  • Comment on RexExp help: Highlight keywords in CGI search results, unless inside an HTML tag
  • Download Code

Replies are listed 'Best First'.
Re: RexExp help: Highlight keywords in CGI search results, unless inside an HTML tag
by gav^ (Curate) on Jan 15, 2002 at 03:07 UTC
    I always think that parsing HTML with regexp's is bad and go for something like HTML::Parser or HTML::TreeBuilder everytime. This may or may not be an overkill for your situation.
    use HTML::Parser; my $keyword = quotemeta 'match'; sub highlight { my $text = shift; $text =~ s|\b($keyword)\b|<span style="background: #FFFFFF">$1</sp +an>|g; return $text; } my $html = q{ <p id="match"><b>this is html match this</b><u value="don't match +this">text html blah</u></p> }; my $p = new HTML::Parser(api_version => 3, handlers => { text=> [ sub { print highlight(shift) }, 'text'], default => [sub { print shift }, 'text'], }); $p->parse($html);

    gav^

Re: RexExp help: Highlight keywords in CGI search results, unless inside an HTML tag
by Chady (Priest) on Jan 15, 2002 at 01:47 UTC

    Completely un-optimal, completely un-tested.. but I think I can get away with something like that from your code?

    $newline =~ s|(?<!\<)$keyword(?!\>)|<font style="BACKGROUND: #ffffff"> +$keyword</font>|img;

    The (?<!\<) implies that this shouldn't match a < before the $keyword, and the (?!\>) shouldn't match a > after it... so it's not in a tag.(I guess)


    He who asks will be a fool for five minutes, but he who doesn't ask will remain a fool for life.

    Chady | http://chady.net/
      There is nothing wrong with having a < before or a > after it.. so long as the < has a > after it before and vice versa for after.... so I don't think yours will quite work...
      I came up with the following... which seems to work at least somewhat...
      s|(\Q$text\E)(?![^<]*>)|<I>$1</I>|gi;
      which matches the text so long as it is not followed by a > before it sees a <

      Update \me laughs as joealba and I both post the code I gave him at the same time...

                      - Ant
                      - Some of my best work - (1 2 3)

      Thanks, Chady!

      suaveant gave me this little hunk of code that seems to be working great:
      $text =~ s|($keyword)(?![^<]*[>\Z])|<B>$1</B>|img;
      I've never used that explicit lookahead before. I love/hate it when I'm shown something so useful (when I should have known it before). I KNOW I read pp. 228-230 of Mastering Regular Expressions one day last year... :)