Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Search engine

by larsen (Parson)
on Aug 01, 2000 at 16:36 UTC ( [id://25440]=perlquestion: print w/replies, xml ) Need Help??

larsen has asked for the wisdom of the Perl Monks concerning the following question:

I recently wrote a small search engine for a site. It consists in two script: a script that reads up HTML pages to create an index (some db-files) and a CGI script to consult this index. When the user enter a word, CGI script looks up the hash table in order to pick the corresponding ID. Then, it searches for that ID in another hash to find files containing that word. Something like this:
$id = $Words_db{$word}; foreach $i (keys %Index_db) { if ($i == $id) { @fileId = split( /:/, $Index_db{$i}); foreach $fId (@fileId) { # ... } } }
It works just fine, thanks to hash tables. Now I'd like to allow users to write only pieces of words to perform the search (e.g.: "man" will match "man" and "maniac"). In this case, I'd have to modify the code. Something like:
my $piece; foreach (keys %Words_db) { if ( ... ) { # if $piece is a substring of $_ ... } else { $piece does not occur in $_ ... } }
I didn't try that, because it seems to be too inefficient. I'd be glad to see your suggestion. Thank you. Larsen

Replies are listed 'Best First'.
Re: Search engine
by davorg (Chancellor) on Aug 01, 2000 at 16:39 UTC

    I think that you're about to find the limits of using a dbm file for stuff like this.

    You should seriously consider using a relational database.

    --
    <http://www.dave.org.uk>

    European Perl Conference - Sept 22/24 2000, ICA, London
    <http://www.yapc.org/Europe/>
RE: Search engine
by nuance (Hermit) on Aug 01, 2000 at 17:02 UTC
    foreach $i (keys %Index_db) { if ($i == $id) { @fileId = split( /:/, $Index_db{$i}); . . . } }

    Why are you iterating over the keys of this hash? surely you should just do:

    $id = $Words_db{$word}; @fileId = split( /:/, $Index_db{$id});

    And that would have the same effect in a much shorter time.

    Why do you have two hashes if the first only holds an index into the second. Reading over this it looks like you were previously using an array for Index_db, and that the hash held the array index where you could find the information. I think you should look at the code again as you can probably do without the Words_db hash, and clean up the logic to exploit the hash you are using.

    Nuance

      Oh yes, surely. I think this is a very old version of my code (by the way, I think one has to be very stupid to write something like this).
      I hope there's a way to decrease monks' experience :) Thank you.
      Larsen
        You don't have to be stupid to write that, if you've been dealing with a problem for a long time sometimes you forget why you made certain implementation decisions. Often a "fresh" pair of eyes see things that you missed because you are "too familiar" with them.

        Nuance

Re: Search engine
by maverick (Curate) on Aug 01, 2000 at 19:13 UTC
    Check out the docs for Berkeley DB, (perldoc DB_File). Aside from being one of the faster DBM formats, it has support for duplicate keys, and partial key matches. I've only tinkered with this briefly, but it may do what you need.

    /\/\averick

Re: Search engine
by lhoward (Vicar) on Aug 01, 2000 at 16:52 UTC
    Rather than fuzz in the search, it may be easier to fuzz in your database. i.e. if a page contains "maniac" make entries for both "man" and "maniac" (and whatever variations you want) in the DB so you can never have to do aproximate searching in your search code (since its already built into your DB). Your DB will be larger but the performance will be better.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://25440]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (3)
As of 2024-04-25 05:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found