Search engine code critique

ruhk has asked for the wisdom of the Perl Monks concerning the following question:

Hellow fellow monks!

Let me start out with admitting that the code you are about to see is most likely horrible, messy, and overly complicated. Ok, with that out of the way, I wrote a Major Search script for the University and it works, however I am interested in cleaning up the code and learning what not to do in the future. I think a good way are to have some bored monks take a look at it and tell me what I did thats wrong.

Ok, on the what the search does. It searches a database that has a few fields, such as the major name, the description of the major, and a few links where you can find faq sheets online. I have the script first try to find a match in the title, and if it does I weigh that heavily. Next, I find a whole word match in the description, and I weight that a bit less. Finally, I find any partial matches in the description and weigh that even less. I add it all up and sort them by the weighted hit count. You can find it at this page (link removed by Ovid until the security hole is fixed).

Finally, the important part. The code. Please do not be gentle with it, I am prepared to have it ripped apart. Thanks in advance monks!

#!E:/perl/bin/perl.exe -w
use strict;

use DBI;
use CGI;
use HTML::Template;
use Data::Dumper;
use Text::ParseWords;
use CGI::Carp qw(fatalsToBrowser);

my @results;
my $searchStuff;
my $start;
my $major;
my %nav;
my $even = 0;
my @mini;
my $description;
my $booleanflag = 1;
my $SCRIPTLOC = 'http://...';
my $dbh = _connectToDB();

getParameters();
searchDatabase();
displayResults( $searchStuff );

sub getParameters
{
    my $query = new CGI;
    print $query->header(-type=>'text/html');
    $searchStuff = $query->param('search');
    $start = $query->param('start') || 0;
}

sub _connectToDB
{
   use Pass::PassInterface;
   my $interface = newDB PassInterface();
    my $SQL_SERVER = 'hydrogen\\devcgi';
    my $TCPIP = '192.168.31.154:1343';
    my $USERNAME = 'Web_Team';
    my $PASSWORD = $interface->getPass($USERNAME);
    my $DATABASE = 'Steve';
    my $DBH;
    #DBI->trace(2, 'DBIDebug.txt');
    die ('Couldn\'t open database.' . $DBI::errstr. $@ )
        unless (eval ('$DBH = DBI->connect( \'dbi:ODBC:driver={SQL Ser
+ver};server='.$SQL_SERVER.';tcpip='.$TCPIP.';database='.$DATABASE.';u
+id='.$USERNAME.';pwd='.$PASSWORD.';\',\''.$USERNAME.'\',\''.$PASSWORD
+.'\',{\'LongTruncOk\' => 0, LongReadLen => 512 * 1024 } );'));
    return $DBH;
}
sub searchDatabase
{
    my @words = shellwords($searchStuff);
    my $searchstring = 'SELECT * from MajorSearch WHERE ';
    my $onlyonce = 1;
    foreach my $tempword ( @words )
    {
        if ( $onlyonce == 1 )
        {
            $searchstring .= "MajorDescription LIKE '%$tempword%' OR M
+ajorName LIKE '%$tempword%'";
            $onlyonce = 0;
        }
        else
        {
            $searchstring .= " OR MajorDescription LIKE '%$tempword%' 
+OR MajorName LIKE '%$tempword%'";
        }
    }
    my $sth = $dbh->prepare( $searchstring);
    $sth->execute() or die "Couldn't execute: '$DBI::errstr'";
    my $results = $sth->fetchall_arrayref({});
    my $hits;
    my $titlehits;
    my $weightedhits;
    @words = shellwords($searchStuff);
    foreach my $result ( @$results )
    {
        foreach my $currentword ( @words )
        {
            $hits += ($$result{'MajorDescription'} =~ s/($currentword)
+/$1/gi);
            $weightedhits += ($$result{'MajorDescription'} =~ s/\b($cu
+rrentword\b)/$1/gi) * 25;
            $hits += $weightedhits;
            $titlehits += ($$result{'MajorName'} =~ s/($currentword)/$
+1/gi) * 50;
            $hits += $titlehits;     
            $$result{ 'hits'} +=  $hits;
            $hits = 0;
            $titlehits = 0;
            $weightedhits = 0
        }
    }
    for ( my $i = scalar(@$results) - 1; $i >= 0; $i-- )
    {
        for ( my $j = 1; $j <= $i; $j++)
        {
            if ( $$results[$j - 1]{'hits'} < $$results[$j]{'hits'} )
            {
                my $temp = $$results[$j - 1];
                $$results[$j - 1] = $$results[$j];
                $$results[$j] = $temp;
            }
        }
    }
    $nav{'xCount'} = @$results;

    if ( $nav{'xCount'} <= 10 )
    {
        $nav{'viewStart'} = $nav{'xCount'} < 1 ? 0 : 1;
        $nav{'viewEnd'} = $nav{'xCount'};        
    }
    else
    {
        $nav{'viewStart'} = $start > $nav{'xCount'} ? $nav{'xCount'} :
+ $start <= 0 ? 1 : $start ? $start : 1;
        $nav{'viewEnd'} =  $nav{'viewStart'} + 9 > $nav{'xCount'} ?  $
+nav{'xCount'} : $nav{'viewStart'} + 9;        
        $nav{'previousStart'} = $nav{'viewStart'} <= 1 ? undef : $nav{
+'viewStart'} > 10 ? $nav{'viewStart'} - 10 : 1;
        $nav{'nextStart'} = $nav{'viewEnd'} >= $nav{'xCount'} ? undef 
+: $nav{'viewEnd'} + 1;
        $nav{'previous'} = $nav{'viewStart'} - $nav{'previousStart'} i
+f ( $nav{'previousStart'} );
        $nav{'next'} = $nav{'nextStart'} + 9 > $nav{'xCount'} ? $nav{'
+xCount'} - $nav{'nextStart'} + 1 : 10 if ( $nav{'nextStart'} );
    }
    
    if ( ($nav{'viewEnd'} - $nav{'viewStart'} + 1) % 2 == 1 )
    {
        $even = 1;
    }
    else
    {
        $even = 0;
    }
    if ( $nav{'xCount'} > 0 )
    {
        if ( $nav{'xCount'} > 10 )
        {
            @mini = splice( @$results , $nav{'viewStart'} - 1 , $nav{'
+viewEnd'} - $nav{'viewStart'} + 1 );
        }
        else
        {
            @mini = @$results;
            
        }
    }
    else
    {    
        @mini = @$results;
        $booleanflag = 0;
    }
    
    for ( my $i = 0; $i < scalar(@mini) ; $i++ )
    {
        delete $mini[$i]{'MajorDescription'};
        delete $mini[$i]{'MajorKey'};
        delete $mini[$i]{'hits'};
        }
}
sub displayResults
{
    my ( $foundMessage ) = @_;
    my $template = HTML::Template->new(filename => 'out.tmpl', loop_co
+ntext_vars => 1);
    my $lessthanten = $nav{'xCount'} >= 10 ? 1 : 0;
    $template->param( 
        viewStart => $nav{'viewStart'} , 
        viewEnd => $nav{'viewEnd'} , 
        xCount => $nav{'xCount'} ,
        previousStart => $nav{'previousStart'} , 
        previous => $nav{'previous'} ,
        next => $nav{'next'} , 
        nextStart => $nav{'nextStart'} , 
        SCRIPTLOC => $SCRIPTLOC , 
        searchStuff => $searchStuff , 
        MATCHES => \@mini , 
        results => $booleanflag,
        even => $even,
        lessthanten => $lessthanten
    );
    print $template->output;
}
[download]

Edited 2003-09-18 by Ovid. Removed link to Web site until security hole can be fixed. I hate censoring posts, but I think this one is important. See my post below.

Edit by tye, add READMORE, remove other copy of URL, IP address

Comment on Search engine code critique Download Code

Replies are listed 'Best First'.
Re: Search engine code critique by dragonchild (Archbishop) on Sep 18, 2003 at 19:21 UTC
A few thoughts: Your dbh setup looks overly complicated. Could you explain why you couldn't do something like: `my $dbh = DBI->connect( $connect_string, $user, $pass, ); die "Couldn't connect to DBI\n" unless $dbh; $dbh->{RaiseError} = 1;` [download] Whenever I see a bunch of variables defined at the top, I cringe. Having one or two globals ... I can see. These would be things along the lines of $dbh and the like. 12 is a little much. Why does everyone need to know about $even? It's used in exactly two places. In fact, you calculate it in one place, then use the result in another. Why not just calculate it in the place you use it? At least, calculate it in the function you use it ... You don't check the return result of $dbh->prepare(), even though you're passing it a dynamically-generated SQL statement. You also don't call $sth->finish(). This isn't a problem, until you use prepare_cached(). It's just a good habit to get into. You can do the following type of replacement in a lot of places: `for ( my $i = 0; $i < scalar(@mini) ; $i++ ) { delete $mini[$i]{'MajorDescription'}; delete $mini[$i]{'MajorKey'}; delete $mini[$i]{'hits'}; } ---- foreach my $i (0 .. $#mini) { delete $mini[$i]{$_} for qw(MajorDescription MajorKey hits); }` [download] ------ We are the carpenters and bricklayers of the Information Age. The idea is a little like C++ templates, except not quite so brain-meltingly complicated. -- TheDamian, Exegesis 6 Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.	[reply] [d/l] [select]
Re: Re: Search engine code critique by chromatic (Archbishop) on Sep 18, 2003 at 20:01 UTC
You also don't call $sth->finish(). This isn't a problem, until you use prepare_cached(). Why would it be a problem then? He uses `fetchall_*`.	[reply] [d/l]
Re3: Search engine code critique by dragonchild (Archbishop) on Sep 18, 2003 at 20:03 UTC
It's a good habit to get into, imho. Most times, you can omit the finish(), but sometimes it will bite you in the ass. Adding the finish() call never hurts, so I always do it. ------ We are the carpenters and bricklayers of the Information Age. The idea is a little like C++ templates, except not quite so brain-meltingly complicated. -- TheDamian, Exegesis 6 Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.	[reply]
Re: Re3: Search engine code critique by tachyon (Chancellor) on Sep 19, 2003 at 02:04 UTC
Re: Search (details of exploit removed until security hole is plugged) by Ovid (Cardinal) on Sep 18, 2003 at 20:02 UTC
Update: Whoops! I just posted step by step instructions on how to hack into your box. Then I realized you posted the URL. I have taken the post down until you fix the security hole. In the event that you don't see this, I've sent email to the address listed on the error page I generated. You can also email me for details at poec at yahoo dot com. I will repost the details after this hole is plugged. Cheers, Ovid New address of my CGI Course.	[reply]
Re: Search engine code critique by Enlil (Parson) on Sep 18, 2003 at 19:43 UTC
`for ( my $i = scalar(@$results) - 1; $i >= 0; $i-- ){ for ( my $j = 1; $j <= $i; $j++){ if ( $$results[$j - 1]{'hits'} < $$results[$j]{'hits'} ){ my $temp = $$results[$j - 1]; $$results[$j - 1] = $$results[$j]; $$results[$j] = $temp; } } }` [download] This is the first thing that stands out. I am guessing that you want the ones with the most 'hits' first correct? Why not just use a simple sort. For example, I think this has the same functionality as the above code without me trying to figure out exactly what it was I was trying to accomplish a couple of months down the road (not that commenting it wouldn't help in this respect.) `@$results = sort{ $b->{'hits'} <=> $a->{'hits'} } @$results;` [download] Even if you don't want to take it this far you can exchange these couple lines `my $temp = $$results[$j - 1]; $$results[$j - 1] = $$results[$j]; $$results[$j] = $temp;` [download] with `($$results[$j - 1], $$results[$j]) = ($$results[$j],$$results[$j - 1])` [download] which is similar to one of the first things i remember seeing in the camel (the swapping without a temporary variable thing. Though there might be more things this just stuck out at me as more a C type thing than a Perl type thing -enlil	[reply] [d/l] [select]
Re: Search engine code critique by RMGir (Prior) on Sep 18, 2003 at 19:46 UTC
Here's my suggestion; whether it's an improvement or not I'll leave up to the judgement of the XP meter :) Where you have: `my $onlyonce = 1; foreach my $tempword ( @words ) { if ( $onlyonce == 1 ) { $searchstring .= "MajorDescription LIKE '%$tempword%' OR M +ajorName LIKE '%$tempword%'"; $onlyonce = 0; } else { $searchstring .= " OR MajorDescription LIKE '%$tempword%' +OR MajorName LIKE '%$tempword% +'"; } }` [download] Having "almost" the same SQL code in 2 branches of an if is very hard to maintain or debug correctly, in my opinion. Why not replace all of that with: `$searchstring=join " OR ", map "MajorDescription LIKE '\%$_\%' OR MajorName LIK +E '\%$_\%'", @words;` [download] join and map sometimes scare folks, but I think that's more readable and less fragile than your structure. If you don't like join/map, may I suggest this alternative? `my $or=""; foreach my $tempword ( @words ) { $searchstring .= " $or MajorDescription LIKE '%$tempword%' OR +MajorName LIKE '%$tempword%'"; # only set this once $or="OR" unless $or; }` [download] -- Mike	[reply] [d/l] [select]
Re: Search engine code critique by tachyon (Chancellor) on Sep 19, 2003 at 01:52 UTC
The lack of $dbh->disconnect is a potential disaster waiting to happen. Don't know what the SQL server defaults are but by default MySQL will accept 100 connections that will persist for 8 hours. Although it surprised me undefing $dbh (ie when the script finishes) does not cause a DESTROY{ ..disconnect .. ) to be called allowing the connections to persist and accumulate.....until it falls over with a too many connections error. This particular error often goes unnoticed until you get real load on a widget. One way to code it to make sure this does not happen is `my $sth; my $dbh->connect(...... END{ $sth->finish if $sth; $dbh->disconnect if $dbh } # now do whatever` [download] cheers tachyon s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print	[reply] [d/l]
Re: Re: Search engine code critique by bean (Monk) on Sep 19, 2003 at 03:26 UTC
If ruhk is running mod_perl under Apache, this isn't a problem - the DBI connect method will default to Apache::DBI::connect which keeps persistent connections (in this case, disconnect is overloaded and will not disconnect). If this is a CGI, connect_cached should be used - a persistent connection would be appropriate for this application. However, disconnect-ing doesn't hurt and is good form...	[reply]
Re: Re: Re: Search engine code critique by tachyon (Chancellor) on Sep 19, 2003 at 12:33 UTC
Actually if you use this syntax you will pile up persistent connections under mod_perl. Trust me. We have done it. In production. Disconnecting is much more than good form..... `my $dbh = DBI->connect...` [download] The connection is as you say persistent. This is immaterial. This will pile connections up running under mod_perl/vanilla/or off the command line. The my means that the handle to the connection disappears every time you re-exec your script (effectively a sub). The connection however remains..... If you don't believe me put it in a loop and you will quickly bring your DB to its knees. cheers tachyon s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print	[reply] [d/l]
Re: Re: Re: Re: Search engine code critique by bean (Monk) on Sep 19, 2003 at 19:47 UTC
Re: Search engine code critique by perrin (Chancellor) on Sep 18, 2003 at 19:37 UTC
This would probably be a lot faster if you could replace those LIKE queries with a full-text query.	[reply]