in reply to Re: Faster Search Engine
in thread Faster Search Engine

thanks tachyon, i will buy that book. but can you also recommend something that's not too boring and kind to beginners like me? i am also planning to take some classes on perl and cgi, i wonder if there are some offered here in new york city. do you know any?

i am not using this search engine for http://www.textcentral.com. it is a totally different one. if you wanna see what i've done so far it's at http://alleba.dreamhost.com. my site is called 'alleba', a search engine and web directory. it is true that it searches through a flat text file. it is called links.db. but what do you mean by 'use a real database or generate an index that you can search'? how could i do this?

i'm also posting my links.db and category.db and several others to get you into the whole schematics of the script. everything can be found at: http://alleba.dreamhost.com/scripts.
nph-build.cgi -- rebuilds all the category .html files as well as the homepage (index.html) and directory page (dir.html). i'm not sure if this is relevant in my question.

db-utils.pl -- i think this file contain some subs that search.cgi uses, such as for sorting links and categories.

search.cgi -- this file does the actual search in links.db and category.db.

site_html_templates.pl -- contains the templates in which several generated pages are based on, like for the search results. nph-build also builds category pages based on several elements below this file.

links.db -- the links database, contains the ID, title, keywords etc.

category.db -- contains all the category and subcategory names.

links.def -- contains information on the field assignments of each piece of information. e.g. title 1, description 5.

links.cfg -- here all the important settings are set such as absolute paths and url's that each .cgi and .pl file relies on.

hopefully these files will enlighten everyone. thanks, looking forward to your replies.

drewboy

Replies are listed 'Best First'.
Re: Re: Re: Faster Search Engine
by tachyon (Chancellor) on Jul 22, 2001 at 14:24 UTC

    Here is a really basic search application for you. In this script you are prompted for a search string but this could easily be CGI input. Note that the quotemeta will escape most chars with a \ which 1) makes the string safe to use in the grep regex and 2)helps thwart hackers. *Do not interpolate a user supplied string into a regex without the quotemeta.* It then grep's out all the lines that contain that string and stores them in an array. The /i makes the search case insensitive. It is looking for an exact match only and will not understand boolean logic.

    Using your 70kb 'links.db' text file as the data and searching for 'PlanetGimmick' which is the last entry in the file it takes 0 seconds to run. If you ramp up and do the search 10000 times so that we run long enough to get a valid time it takes 161 seconds or 16.1 milliseconds to do the search. This is on an old PII 233 MHZ 64MB RAM Win95 Perl 5.6 system (my laptop). I expect this is fast enough for most practical purposes. Once you have the match lines in an array you can do whatever processing you want to them. The advantage being that you only process those lines that have matched your search criteria.

    #!/usr/bin/perl -wT use strict; # clean up the environment for CGI use delete @ENV{qw(IFS CDPATH ENV BASH_ENV)}; $ENV{'PATH'} = '/bin:'; # you may need more path info my $db_file = 'c:/links.db'; print "Find what? "; chomp(my $find = <>); # this escapes regex metachars and makes it safe # to interpolate $find into the regex in our grep. $find = quotemeta $find; # this untaints $find - we have made it safe above # using the quotemeta, this satisfies -T taint mode $find =~ m/^(.*)$/; $find = $1; my $start = time(); open (FILE, "<$db_file") or die "Oops can't read $db_file Perl says $! +\n"; my @db_file = <FILE>; # get the whole database into an array in RAM close FILE; # do the search my @lines = grep {/$find/i}@db_file; my $time = time() - $start; print "Search took $time seconds\n"; if (@lines) { print "Found\n@lines\n"; } else { print "No match\n"; }

    I expect this should solve your problem as it is plenty fast enough. It should scale in a linear fashion ie twice as big a file == twice as long for search. The scaling will breakdown when your file becomes larger than can be stored in main memory in an array and the operating system resorts to using swapspace on the disk as virtual RAM. If you get this big send me some options in the IPO OK!

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

      thanks, but i'm a little overwhelmed with everything that you wrote(!). i don't run my site in my own server, if that's what you're presuming (just in case).

      how do i implement your code to my site (running on unix -- at dreamhost). does that mean i have to replace my search.cgi file? or is your script for the purpose of sort of caching the results to my system so that my current search.cgi will perform faster?

      sorry for sounding stupid, i am not that good with perl/cgi. please tell me exactly what to do with the script that you offered. thanks for taking your time to help me out!!

      drewboy

        Yes this is a program that you can currently run on your home computer (you'll need perl see New Monks to get it]. Ultimately any search program will need to be run on the server. It is not configured as a CGI at the moment but could easily be. What sort of results do you want the search to return? Domain name with links to that domain or something else? If you can not get scripts and modules installed on the server let me know if they have the modules 'CGI.pm' and 'HTML::Template' installed. Ask the systems administrator.

        cheers

        tachyon

        s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print