drewboy has asked for the wisdom of the Perl Monks concerning the following question:

Hello perl gurus.. first let me explain my predicament. i am a webmaster of a search engine running on perl. when you do a query, the search.cgi script loads a template, called search_results.html. in that template are tags that look like <%tag%>. one particular tag is troubling me, which is <%search_results%>. say that you search for "perl", and say that we get 5 results. by the default, all instances of 'perl' is bolded with tags. the format would be:

Perl Monks
Description: Website for seekers of perl wisdom.
http://www.perlmonks.com
Category: computers and internet >> programming >> perl

the bolding process is done withing search cgi, that looks like:

# This reg expression will do the trick, and doesn't bold things insid +e <> tags such as # URL's. $link_results =~ s,(<[^>]+>)|(\Q$term\E),defined($1) ? $1 : "<b>$2</b> +",gie;

one problem is when you search for the words 'description' or 'category'. the link results thing also bolds Description: and Category:, which i dont want. to paint the picture clearly:

Yahoo!
Description: Directory of websites each with a description and classified by category.
http://www.yahoo.com
Category: computers and internet >> search engines

each link result also depends on a template called searchlink.html.

it so happened that search.cgi takes it so literally that it bolds everything at its own discretion.

I wonder if there is anything i can do to have only the vital information bolded. actually i would only want the title and description matches bolded and nothing else (not even in the category or url). i believe the code i provided bolds something based on a pattern but apparently it doesn't work and i don't understand how it works, except that it looks for a certain pattern and bolds it. i wonder if i could do something similar like that. an explanation will be greatly appreciated too. thanks, looking forward to your help:-)

drewboy

Replies are listed 'Best First'.
Re: Replacement based on pattern
by scain (Curate) on Sep 25, 2001 at 00:54 UTC
    OK, first here is what your substitution is doing:
    s,(<[^>]+>)|(\Q$term\E),defined($1) ? $1 : "<b>$2</b>",gie
    the commas are the search delimiters, so the thing being searched for is what is in between the first and second commas, and it says to look for either a tag (something that is < followed by anything that is not a > followed by a >); this is all stored in $1; or the thing that is stored in $term, this is stored in $2. Then, it does the substitution: if it found a tag, it just puts it back the way it found it, if it finds $term, it replaces it with $term in bold tags. The gie at the end makes it global (does the replacement everywhere in the string, not just the first time), case insensitive, and lets the replace part execute the code to do the conditional replace. Pressumably it searches for the tag first so that it doesn't make any replacements of $term in a tag, if that happens.

    On to your question: to get it to work the way you want it to, you will have to run this substitution on just the part that you want substituted! Presumably, right now, $link_results contains everything that search.cgi is working on. You will want to break that up so that you can run the substitute only on the part you want. There are several ways to do that, and probably some that make more sense in the context of what else search.cgi is doing. If you want to copy and paste more of the code around where $link_results is generated, we could probably suggest something.

    Good luck,
    Scott

    PS: I meant to include when writing this, that the method for search for tags (ie (<[^>]+>) is not particularly robust. I would make it at least (<[^>]+?>) to make it a minimal match, although you may have enough control over what is in $link_results to be sure it works; nevertheless, use with caution.

Re: Replacement based on pattern
by tachyon (Chancellor) on Sep 25, 2001 at 15:17 UTC

    The (..)|(..) bits in the first part capture into either $1 or $2. As we simply replace $1 with $1 this has the effect of matching tags so you don't substitute text within them. Your regex can effectively be distilled down to: $link_results =~ s,(\Q$term\E),<b>$1</b>,gi;

    So all it does is put bold tags around whatever is in $term. It is case insensitive couresy of the /i modifier. The \Q activates quotemata to escape regex specials in $term. The \E is not required in this case but deactivates quotemeta. See perlman:perfunc. Here is a little widget to do the sort of thing you want. Just push all the terms you want to bold into @terms.

    # test string my $link_results = '<p>Hello: World Hello hello drewboy Drewboy: Drewb +oy!</p>'; # define an array of the terms we want to bold my @terms = ( 'Hello:', 'Drewboy!' ); # make all the terms regex safe by quotemeta-ing them $_ = quotemeta $_ for @terms; # join all terms with a pipe | so we find any of them - alternation my $bold = join '|', @terms; # make all the subs - case sensitive and global $link_results =~ s#(<[^>]+?>)|($bold)#$1 ? $1 : "<b>$2</b>"#eg; # proof is in da pudding print $link_results;

    To avoid bolding where you don't want to we switch off case insensitivity and insist on the punctuation which is apparently present. You could also add the \b or \B boundary modifiers to help ensure that you only match the desired term. I'll leave that as a exercise for you. Using HTML::Parser is a more robust idea to get the text outside of tags for processing.

    Update

    Corected a technical inexactitude ;-) thanks to scain

    Here is an atonement - this is how you do it right using HTML::Parser. We define a hash of tags where the text they contain is OK for substitution. We make our substitution array as before. We then use the power of Parser to selectively make some substitutions - only in the text between the selected tags and absolutely positively not in the tags themselves.

    package Filter; use strict; use base 'HTML::Parser'; my ($filter, $sub_OK); my @ok_tags = qw ( h1 h2 h3 h4 p ); my %ok_tags; $ok_tags{$_}++ for @ok_tags; my @terms = ( 'head', 'Parser' ); $_ = quotemeta $_ for @terms; my $bold = join '|', @terms; sub start { my ($self, $tag, $attr, $attrseq, $origtext) = @_; $sub_OK = exists $ok_tags{$tag} ? 1 : 0; $filter .= $origtext; } sub text { my ($self, $text) = @_; $text =~ s#\b($bold)\b#<b>$1</b>#g if $sub_OK; $filter .= $text; } sub comment { # uncomment to not strip comments # my ($self, $comment) = @_; # $filter .= "<!-- $comment -->"; } sub end { my ($self, $tag, $origtext) = @_; $filter .= $origtext; } my $parser = new Filter; my $html = join '', <DATA>; $parser->parse($html); $parser->eof; print $html; print "\n\n------------------------\n\n"; print $filter; __DATA__ <html> <head> <title>Title</title> </head> <body> <h1>Hello Parser</h1> <p>You need HTML::Parser to ger ahead</p> <p>So use your head <h2>Parser rocks my head!</h2> <a href="html.head.parser.com">html.head.parser.com</a> <hr> <pre> use HTML::Parser; head </pre> <!-- HTML PARSER ROCKS MY HEAD! --> </body> </html>

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Re: Replacement based on pattern
by drewboy (Sexton) on Sep 25, 2001 at 23:20 UTC
    thanks, scain! tachyon, that is quite a lot of code to digest! but thanks also. i don't understand how it works, because i don't see $link_results anywhere, or maybe i could substitute one of the variables you wrote to $link_results? should it be $filter? here some more codes that might interest you:

    a chunk from search.cgi:

    # If we want to bold the search terms... if ($search_bold) { foreach $term (@search_terms) { # This reg expression will do the trick, and doesn't bold things insid +e <> tags such as # URL's. $link_results =~ s,(<[^>]+>)|(\Q$term\E),defined($1) ? + $1 : "<b>$2</b>",gie; } }

    $searchbold is defined in a file called links.cfg. if it is set to 1, bolding is activated, otherwise if equal to 0. Further down in search.cgi, it says:

    # Print out the HTML results. &site_html_search_results; }

    That sub could be found in another file called site_html_templates:

    sub site_html_search_results { # --------------------------------------------- # This routine displays the search results. # my $term = &urlencode ($in{'query'}); &html_print_headers; print &load_template ('search_results.html', { lookup => $lookup_results, antsym => $lookup_antsym, term => $term, ignored_word => $ignored_word, link_results => $link_results, category_results => $category_results, prev => $prev, next => $next, cat_hits => $cat_hits, cat => $cat_clean, link_hits => $link_hits, title_linked => $title_linked, displaying => $displaying, results_page => $results_page, csearch_results => $csearch_results, %in, %globals }); }

    So, from what you suggested, I'm guessing that I may have to revise three different files. but i've no idea how to go about doing that! Sorry that i wasn't able to post codes immediately. I hope i didn't waste your effort. thanks, i really really appreciate your help!