in reply to Faster Search Engine

Hi, to be frank it seems like you need a real search engine. This is covered in some detail in the book CGI programming with Perl from O'Reilly aka the rat book. I recommend this to you.

From the code you have posted it seems you are searching a flat text file form begining to end which as you note does not scale well - ie it gets slower and slower the bigger it gets. The basis of a search is to either use a real database or generate an index that you can search (usually via a hash key). You do the processing in advance when you generate this index and then your CGI searches the index to find what you want. You try to limit the processing that needs to be done in real time (ie for the CGI) so things happen fast from the user's point of view.

A good free search engine you can incorporate into your site is available from http://www.whatuseek.com/ I use this for cheap and cheerful searches on shoestring sites. You can customise the results page into a format that looks like your site. Downside is ads and a page limit. You can see an example of this in action here It is not as good as your own Perl engine could be but is fast and easy to set up. View the source to see how the search box links to the search engine.

Good luck. If you post more code or probably links to what you currently have we may be able to suggest how to speed it up for you. It is not really clear what you want to search for.

Update

Forget the database. I have written a little search app for you that will grep out all the lines that match a given search criterion in your data file in ~16 milliseconds. See below. This should be fast enough for you :-)

cheers

tachyon

s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Replies are listed 'Best First'.
Re: Re: Faster Search Engine
by drewboy (Sexton) on Jul 22, 2001 at 13:00 UTC
    thanks tachyon, i will buy that book. but can you also recommend something that's not too boring and kind to beginners like me? i am also planning to take some classes on perl and cgi, i wonder if there are some offered here in new york city. do you know any?

    i am not using this search engine for http://www.textcentral.com. it is a totally different one. if you wanna see what i've done so far it's at http://alleba.dreamhost.com. my site is called 'alleba', a search engine and web directory. it is true that it searches through a flat text file. it is called links.db. but what do you mean by 'use a real database or generate an index that you can search'? how could i do this?

    i'm also posting my links.db and category.db and several others to get you into the whole schematics of the script. everything can be found at: http://alleba.dreamhost.com/scripts.
    nph-build.cgi -- rebuilds all the category .html files as well as the homepage (index.html) and directory page (dir.html). i'm not sure if this is relevant in my question.

    db-utils.pl -- i think this file contain some subs that search.cgi uses, such as for sorting links and categories.

    search.cgi -- this file does the actual search in links.db and category.db.

    site_html_templates.pl -- contains the templates in which several generated pages are based on, like for the search results. nph-build also builds category pages based on several elements below this file.

    links.db -- the links database, contains the ID, title, keywords etc.

    category.db -- contains all the category and subcategory names.

    links.def -- contains information on the field assignments of each piece of information. e.g. title 1, description 5.

    links.cfg -- here all the important settings are set such as absolute paths and url's that each .cgi and .pl file relies on.

    hopefully these files will enlighten everyone. thanks, looking forward to your replies.

    drewboy

      Here is a really basic search application for you. In this script you are prompted for a search string but this could easily be CGI input. Note that the quotemeta will escape most chars with a \ which 1) makes the string safe to use in the grep regex and 2)helps thwart hackers. *Do not interpolate a user supplied string into a regex without the quotemeta.* It then grep's out all the lines that contain that string and stores them in an array. The /i makes the search case insensitive. It is looking for an exact match only and will not understand boolean logic.

      Using your 70kb 'links.db' text file as the data and searching for 'PlanetGimmick' which is the last entry in the file it takes 0 seconds to run. If you ramp up and do the search 10000 times so that we run long enough to get a valid time it takes 161 seconds or 16.1 milliseconds to do the search. This is on an old PII 233 MHZ 64MB RAM Win95 Perl 5.6 system (my laptop). I expect this is fast enough for most practical purposes. Once you have the match lines in an array you can do whatever processing you want to them. The advantage being that you only process those lines that have matched your search criteria.

      #!/usr/bin/perl -wT use strict; # clean up the environment for CGI use delete @ENV{qw(IFS CDPATH ENV BASH_ENV)}; $ENV{'PATH'} = '/bin:'; # you may need more path info my $db_file = 'c:/links.db'; print "Find what? "; chomp(my $find = <>); # this escapes regex metachars and makes it safe # to interpolate $find into the regex in our grep. $find = quotemeta $find; # this untaints $find - we have made it safe above # using the quotemeta, this satisfies -T taint mode $find =~ m/^(.*)$/; $find = $1; my $start = time(); open (FILE, "<$db_file") or die "Oops can't read $db_file Perl says $! +\n"; my @db_file = <FILE>; # get the whole database into an array in RAM close FILE; # do the search my @lines = grep {/$find/i}@db_file; my $time = time() - $start; print "Search took $time seconds\n"; if (@lines) { print "Found\n@lines\n"; } else { print "No match\n"; }

      I expect this should solve your problem as it is plenty fast enough. It should scale in a linear fashion ie twice as big a file == twice as long for search. The scaling will breakdown when your file becomes larger than can be stored in main memory in an array and the operating system resorts to using swapspace on the disk as virtual RAM. If you get this big send me some options in the IPO OK!

      cheers

      tachyon

      s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

        thanks, but i'm a little overwhelmed with everything that you wrote(!). i don't run my site in my own server, if that's what you're presuming (just in case).

        how do i implement your code to my site (running on unix -- at dreamhost). does that mean i have to replace my search.cgi file? or is your script for the purpose of sort of caching the results to my system so that my current search.cgi will perform faster?

        sorry for sounding stupid, i am not that good with perl/cgi. please tell me exactly what to do with the script that you offered. thanks for taking your time to help me out!!

        drewboy