Update: The upgrades by ar0n (see below) has b0rked my code at the moment. :) No use trying it, since it will report "no results".
So, of course I whipped up a test.
Also, someone said that if you present some code, the change is more likely to happen - not that I am so sure this code can be put to any use, but at least it demonstrates what i mean. Since I don't have access to the PM code, this CGI (which is really just an example) does a search here on perlmonks, and parses the result, then weighs the resulting titles after how many matched words it contained. Also, it weighs down replies, which are probably less interesting (you can see the whole list from the root node, and you don't know which one will be the one you want). Maybe they should even be sorted out altogether.
Queries I tried with lots of improvement include: "difference between two dates", "exec ssi" and "sort an array". Feel free to test more "common" queries, on PM and with my script. Since PM returns results in different order each time, results with the same "weight" will also be reordered upon retries.
If anyone feel like implementing this, it is the last part of the code that is of interest, the weighing code itself. I'd also probably sort on reputation if I had that easily accessible - but that would be more important for replies. The algorithm is probably not that good, but it is a start. :) I suppose there are other stuff to go on if you have access to the result of a direct SQL query too.
Anyhow, here is the code. Remember, it is something hacked together, be nice *grin*. Nah, bring it on if you have improvements:
Yes, I know. One big ugly sub is bad. I just didn't feel like fixing it at the moment. The same goes for creating links like that in the middle, it is just for the display, but should have no impact on the sorting. :)#!/usr/bin/perl -w # Small CGI program that does a normal perlmonks.org search # and weighs the result depending on number of matched words # and if it is a root node or a reply. # TODO: remove common words like 'in' and 'from'? # BUG: If search results in just one node, this will report 'none foun +d'. use strict; use CGI; use LWP::Simple; use HTML::TokeParser; # Buffering off $| = 1; my $q = CGI->new; # Print out the html print $q->header; print $q->start_html(-title => 'Perlmonks weighted search'); print $q->h1('Perlmonks weighted search'); print '<p>Search <a href="http://www.perlmonks.org">perlmonks.org</a>: +</p>'; print $q->start_form(-method => 'get', -action => $q->self_url); print $q->textfield(-name => 'query'); print $q->br, $q->br; print $q->submit(-name => 'search', -value => 'Search'); print $q->end_form; # Do we have a search? if($q->param('search') eq 'Search') { print $q->hr; my $search = $q->param('query'); # Do a search on perlmonks and weigh it: my @list = &get_weighted( $search ); # Did we get any results? if(@list) { print "<p>Weighted results for <i>$search</i>:</p>"; # Print the list print '<ul>'; foreach(@list) { print "<li>$$_[1]</li>"; } print '</ul>'; } # No results :( else { print "<p>No results were found for <i>$search</i>.</p>"; } } print $q->end_html; # The do-it-all sub sub get_weighted { # The search query: my $search = shift; # Split it up my @query = split ' ', $search; # ...and regroup (prob not necessary) my $search_words = join '%20', @query; # Fetch from the real thing: my $html = get "http://www.perlmonks.org/index.pl?node=$search_wor +ds" or die "$!"; my $parser = HTML::TokeParser->new(\$html); # Place to store our node titles my @results = (); # We only want links within a list: while($parser->get_tag('li')) { # The first tag after the <li> my $token = $parser->get_token(); # Avoid keyword nodelet next if($token->[1] !~ /^a$/i); # Get url for listing my $href = $token->[2]{href}; # Get the node title $token = $parser->get_token; # Sanity check next if($token->[0] ne 'T'); # Put together for display my $link = "<a href=\"http://www.perlmonks.org$href\">$token-> +[1]</a>"; # Add to list push @results, $link; } ################ # Here is the 'important' code: ################ # Temporary placeholder my @unsorted = (); for(0..$#results) { # Copy to two-dimensional array for later sorting: $unsorted[$_][1] = $results[$_]; # Weight list depending on how many matching words it contains +: foreach my $word (@query) { # Just ++ made it more unpredictable... $unsorted[$_][0] += ($unsorted[$_][0] || 1) if($results[$_ +] =~ /\b$word\b/i) } # Weight down the answers, not perfect # since it will not correctly take care of # Re(5) and friends. If that matters. my @re_count = $results[$_] =~ /(Re:)/gi; $unsorted[$_][0] -= @re_count; $unsorted[$_][1] .= "($unsorted[$_][0])"; # Maybe we should weight up 'Answer:' too? } # Now weigh the unsorted list: my @weighted = sort{ $$b[0] <=> $$a[0] } @unsorted; return @weighted; }
Other suggestions might include doing such a search and present the five(?) best search results when someone previews a post to SOPW. Something like: "Something similar has already been posted, maybe some of these will answer this question?" with an option to view the whole list as well? Or would the impact be too high?
While I am at it, the HTML on the search page should really have <ul> ... </ul> around the search result, and should close the <li> tags. If I may be picky. :)
Well, anyhow, maybe this could be put to some use? Otherwise, it is also available (such as it is) at http://dogandpony.perlmonk.org/cgi-bin/search_pm.pl.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Weighted results on a perlmonks search
by elusion (Curate) on May 08, 2002 at 20:58 UTC | |
|
(ar0n) Re: Weighted results on a perlmonks search
by ar0n (Priest) on May 09, 2002 at 01:22 UTC | |
by Dog and Pony (Priest) on May 09, 2002 at 11:39 UTC | |
|
I've always wanted to do this.
by boo_radley (Parson) on May 08, 2002 at 23:35 UTC | |
by Dog and Pony (Priest) on May 09, 2002 at 20:11 UTC |