Weighted results on a perlmonks search

Lately, with Super search being not what it was, I've more and more resorted to searching via the "basic search" - the box at the top of the page. I've found that searching that way often brings me the nodes I'd like to see, since *often* the good nodes have the right keywords in the title. However, it presents the result in an unsorted fashion, which can make the node you want appear very far down the list. So I started thinking, why not weigh the results for relevance? One of the hopes would also be less "asked and answered" for those that actually try to search first, and gives up on the long list.

Update: The upgrades by ar0n (see below) has b0rked my code at the moment. :) No use trying it, since it will report "no results".

So, of course I whipped up a test.

Also, someone said that if you present some code, the change is more likely to happen - not that I am so sure this code can be put to any use, but at least it demonstrates what i mean. Since I don't have access to the PM code, this CGI (which is really just an example) does a search here on perlmonks, and parses the result, then weighs the resulting titles after how many matched words it contained. Also, it weighs down replies, which are probably less interesting (you can see the whole list from the root node, and you don't know which one will be the one you want). Maybe they should even be sorted out altogether.

Queries I tried with lots of improvement include: "difference between two dates", "exec ssi" and "sort an array". Feel free to test more "common" queries, on PM and with my script. Since PM returns results in different order each time, results with the same "weight" will also be reordered upon retries.

If anyone feel like implementing this, it is the last part of the code that is of interest, the weighing code itself. I'd also probably sort on reputation if I had that easily accessible - but that would be more important for replies. The algorithm is probably not that good, but it is a start. :) I suppose there are other stuff to go on if you have access to the result of a direct SQL query too.

Anyhow, here is the code. Remember, it is something hacked together, be nice *grin*. Nah, bring it on if you have improvements:

#!/usr/bin/perl -w

# Small CGI program that does a normal perlmonks.org search
# and weighs the result depending on number of matched words
# and if it is a root node or a reply.


# TODO: remove common words like 'in' and 'from'?

# BUG: If search results in just one node, this will report 'none foun
+d'.

use strict;
use CGI;
use LWP::Simple;
use HTML::TokeParser;

# Buffering off
$| = 1;

my $q = CGI->new;

# Print out the html

print $q->header;

print $q->start_html(-title => 'Perlmonks weighted search');

print $q->h1('Perlmonks weighted search');

print '<p>Search <a href="http://www.perlmonks.org">perlmonks.org</a>:
+</p>';

print $q->start_form(-method => 'get', -action => $q->self_url);

print $q->textfield(-name => 'query');

print $q->br, $q->br;

print $q->submit(-name => 'search', -value => 'Search');

print $q->end_form;


# Do we have a search?
if($q->param('search') eq 'Search')
{
    print $q->hr;

    my $search = $q->param('query');

    # Do a search on perlmonks and weigh it:
    my @list = &get_weighted( $search );
    
    # Did we get any results?
    if(@list)
    {
        print "<p>Weighted results for <i>$search</i>:</p>";

        # Print the list
        print '<ul>';
        foreach(@list)
        {
            print "<li>$$_[1]</li>";
        }
        print '</ul>';
    }

    # No results :(
    else
    {
        print "<p>No results were found for <i>$search</i>.</p>";    
    }
}

print $q->end_html;


# The do-it-all sub

sub get_weighted
{
    # The search query:
    my $search = shift;
    
    # Split it up
    my @query = split ' ', $search;

    # ...and regroup (prob not necessary)
    my $search_words = join '%20', @query;


    # Fetch from the real thing:
    my $html = get "http://www.perlmonks.org/index.pl?node=$search_wor
+ds" or die "$!";


    my $parser = HTML::TokeParser->new(\$html);

    # Place to store our node titles
    my @results = ();

    # We only want links within a list:
    while($parser->get_tag('li'))
    {
        # The first tag after the <li>
        my $token = $parser->get_token();

        # Avoid keyword nodelet
        next if($token->[1] !~ /^a$/i);

        # Get url for listing
        my $href = $token->[2]{href};

        # Get the node title
        $token = $parser->get_token;

        # Sanity check
        next if($token->[0] ne 'T');

        # Put together for display
        my $link = "<a href=\"http://www.perlmonks.org$href\">$token->
+[1]</a>";

        # Add to list
        push @results, $link;    

    }


    ################

    # Here is the 'important' code:

    ################

    # Temporary placeholder
    my @unsorted = ();

    for(0..$#results)
    {
        # Copy to two-dimensional array for later sorting:
        $unsorted[$_][1] = $results[$_];

        # Weight list depending on how many matching words it contains
+:
        foreach my $word (@query)
        {
            # Just ++ made it more unpredictable...
            $unsorted[$_][0] += ($unsorted[$_][0] || 1) if($results[$_
+] =~ /\b$word\b/i)
        }

        # Weight down the answers, not perfect
        # since it will not correctly take care of
        # Re(5) and friends. If that matters.
        my @re_count = $results[$_] =~ /(Re:)/gi;
        $unsorted[$_][0] -= @re_count;

        $unsorted[$_][1] .= "($unsorted[$_][0])";

        # Maybe we should weight up 'Answer:' too?
    }
    
    # Now weigh the unsorted list:
    my @weighted = sort{ $$b[0] <=> $$a[0] } @unsorted;
    
    return @weighted;
}
[download]

Yes, I know. One big ugly sub is bad. I just didn't feel like fixing it at the moment. The same goes for creating links like that in the middle, it is just for the display, but should have no impact on the sorting. :)

Other suggestions might include doing such a search and present the five(?) best search results when someone previews a post to SOPW. Something like: "Something similar has already been posted, maybe some of these will answer this question?" with an option to view the whole list as well? Or would the impact be too high?

While I am at it, the HTML on the search page should really have <ul> ... </ul> around the search result, and should close the <li> tags. If I may be picky. :)

Well, anyhow, maybe this could be put to some use? Otherwise, it is also available (such as it is) at http://dogandpony.perlmonk.org/cgi-bin/search_pm.pl.

You have moved into a dark place.
It is pitch black. You are likely to be eaten by a grue.

Comment on Weighted results on a perlmonks search Select or Download Code

Replies are listed 'Best First'.
Re: Weighted results on a perlmonks search by elusion (Curate) on May 08, 2002 at 20:58 UTC
More than weighted results, I'd love to see results sorted by thread, with child nodes indented. Like this: Parent Node Re: Parent Node <ul Re:Re: Parent Node Re: Parent Node Re: Parent Node Re: Parent Node Re: Parent Node Other Node Re: Other Node I don't have any code for this, obviously, but there should be a solution using parent node ID's. If someone was ambitious, maybe it could give us the author in another column to the right. elusion : http://matt.diephouse.com	[reply]
(ar0n) Re: Weighted results on a perlmonks search by ar0n (Priest) on May 09, 2002 at 01:22 UTC
I fixed Search results so that it now displays the results as a list sorted by hits per node. This is only a temporary fix, as I intend to sophistimacate it a bit.	[reply]
Re: (ar0n) Re: Weighted results on a perlmonks search by Dog and Pony (Priest) on May 09, 2002 at 11:39 UTC
Looks better already (although the fixes broke my code ;) ). I still suggest some kind of weighting like my suggestion, together with what updates you intend. I guess it is just us Google users that has learnt to formulate our queries so every word counts - that is partly why I got what I judged to be better results, of course. You have moved into a dark place. It is pitch black. You are likely to be eaten by a grue.	[reply]
I've always wanted to do this. by boo_radley (Parson) on May 08, 2002 at 23:35 UTC
`<merlyn>`Perlmonks Magic 8 ball. `</merlyn>`	[reply] [d/l] [select]
Re: I've always wanted to do this. by Dog and Pony (Priest) on May 09, 2002 at 20:11 UTC
Laughs Who'd have thought it? :) Change to Super search: There are certainly some serious issues with this new search technique (including generating "Server Error"s -- no need to report those). You might be happier using merlyn's writeups or Google's advanced search against merlyn's WebTechniques. ;-) Can we add that? grin You have moved into a dark place. It is pitch black. You are likely to be eaten by a grue.	[reply]