Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Help on building a search engine (design / algorithms / parsing / tokenising / database design)

by bobtfish (Scribe)
on Jun 07, 2004 at 10:59 UTC ( #361937=perlquestion: print w/replies, xml ) Need Help??

bobtfish has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks

I've recently been working on a project to mirror various file sources from web sites. I have got a number of mirror scripts that work out the files that need to be mirrored, and do the mirroring. Once mirrored, each file has a number of tests performed on it (to try and classify it), and then an entry is entered into an SQL database for that file. The files can be of many different types, however the most common types are:
  • Text (English language) files
  • Language Source files (c, perl, php are the most common, although shell, c#, java etc do exist also)
  • Compressed files (gzip, bzip2)
  • Archive files (tar, gzip, cpio)
The number of files is likly to grow pretty large over the lifetime of this project (think a couple of million)..

What I want to be able to do is a Google style search across all the files meta data and their contents (or probably just comments extracted from the file for program files).

I know that building an inverse index of the data and searching that is probably the way to go, and I have an initial implementation that works quite nicely.
Therefore my first question: Can anyone recommend a comment parser that I can throw an arbitary(ish) language file and get back all the comments?

Update: Mostly solved by matching comments as suggested by andyf. However if someone wants to point me at a packaged (CPAN?) solution that'd still be welcome.

At this point there is meta-information about this file:
  • file_id - A unique id for this file
  • file_source - An integer that links to another table giving details of there the file came from etc.
  • md5hash - The md5 hash of the file
  • description - Some of our file sources provide their own meta-data about the files (in current cases, this is just a textual / HTML description)
  • comment - User comments about this file (initially blank)
I have built a search engine for just comments / descriptions, however I'm not sure my design would stand up to full contents or a massive amount of data (however it's a shedload faster than the database's regex search currently).

So, I have three tables:
  • files - contains file_id and comment/description
  • search_files_tokens - contains normalised tokens for search and token_id
  • search_files_to_tokens - contains a token_id, a file_id and a count of the number of times this token appears in this file.
What I do is as follows:
  • For each file, take comments and descriptions and contatenate them into $comment.
  • Split on whitespace into individual tokens.
  • Normalise each token (remove punctuation, tr/a-z/A-Z/ etc)
  • For each new token, create a token in the database (returning a unique token_id).
  • For each token found in the comment, insert a row into search_files_to_tokens.
I can now lookup a unique token, and find the files that contain it most often.

The code that does this looks like:
my $known_tokens; while ($comment = get_next_file_comments()) { my $tokens; my @possible_tokens = split(/\s+/s, $comment); foreach (@possible_tokens) { my $hs = HTML::Strip->new(); #HTML::Strip segfaults without ne +w instance each time $_=$hs->parse($_); /[\w\d\.\/\\]+/) or next; tr/a-z/A-Z/; s/[\s\',\!\"\\xa3\#\$\%\^\&\*\\=\+\:\'\`\(\)\{\}\[\]\.\/\\\-\_]//g +; next if ($_ eq ""); if (!defined($tokens->{$_})) { $tokens->{$_}=1; } else { $tokens->{$_}++; } } foreach (keys(%{$tokens})) { my $token_id; if (defined($known_tokens->{$_})) { $token_id = $known_tokens->{$_}; } else { my $res = $gettokenid_sth->execute($_); ($gettokenid_sth->rows() > 0) and do { ($token_id) = $gettokenid_sth->fetchrow_array(); }; if (!defined($token_id) or !$token_id) { $addtoken_sth->execute($_); $gettoken_sth->execute($_); ($token_id) = $gettokenid_sth->fetchrow_array(); } $known_tokens->{$_}=$token_id; } $addtokeninfile_sth->execute($file_id, $token_id, $tokens->{$_ +}); } }

All reasonably simple, and reasonably sane (other than the fact that my tokeniser is a bit rubbish currently)..

Now, when I come to search, I allow the user to input a phrase such as:

(linux or sun) and log

I parse the search query into bits I know about; (,),and,or,not and bits I don't; linux, sun, log

For anything I don't know about, I tokenise it (as above) then search for that token (the case of a non-existant token adds lots of mess, but is reasonably easy to handle, so won't be considered here).

Once I have a token_id for each token, I then build an SQL query along the lines of:

SELECT file_id FROM search_files_to_tokens WHERE token_id = $token_id

for each file, and chain them together with UNION, INTERSECT and EXCEPT

For example, if tokens for linux = 1, sun = 2 and log = 3, query will be:

((SELECT file_id FROM search_files_to_tokens WHERE token_id = 1) UNION ((SELECT file_id FROM search_files_to_tokens WHERE token_id = 2)) INTERSECT (SELECT file_id FROM search_files_to_tokens WHERE token_id = 3))

This gives me back a pile of rows containing a file_id for files that match the search criteria. I can then do another simple (and fast) query which gives me back the counts for each token/file instance.

Taking that data, I can easilly iterate through it and provide each file with an overall score. My scoring system is simple:

For each token in this file, add (number of times this token appears in this file / number of times token appears in the result set as a whole) and this is your score.

I then sort the result set by score and display.

This generally works very well, however for queries with common tokens (that produce a large result set), it can be quite slow (but still faster than regex search). Also, search speed can change drastically with query order. If I start indexing comments in files / text file contents, it's going to get unusable..

Questions:
  • Is my database schema sane, (or, sane but dumb)?
  • Is my search algorithm sane / can it be optimised easilly (or again, just fine, but dumb)?
  • Am I going about this totally the wrong way?
  • Is this the optimum method, and I should just get faster hardware?
I know that a simple optimisation would be to throw out the top five to ten tokens (going to do this, just not implemented yet) and maybe throwing away any token that only occurs on one file (this would produce a massive optimisation, however it scares me as people will want to search for specific things that they know exist by name)


I look forward to comments / questions from the Monastry as my knowlegde of complex (statefull) parseing is pretty limited, and I've never tried a search algorithm before..

Caveat: There is a command line (perl) search interface, however I must also be able to create a php search interface. Therefore using CPAN modules to do the parsing etc is good, using CPAN to do tokenisation or search just isn't going to happen unless the module is so simple (or easy to read) that re-implementing it in php (in a reeasonably small amount of time) is viable.

Update: Added <readmore> tags as I realised this post is huge and added answer to the first question so people don't waste their time searching for me.
  • Comment on Help on building a search engine (design / algorithms / parsing / tokenising / database design)
  • Download Code

Replies are listed 'Best First'.
Re: Help on building a search engine (design / algorithms / parsing / tokenising / database design)
by inman (Curate) on Jun 07, 2004 at 11:28 UTC
    The problem that you are trying to solve is traditional word based search which is probably more suited to a search engine than a SQL database. A typical search engine will allow you to index content such that an efficient word index is created in addition to a relational style table to store the meta-data for each document.

    The search engine approach involves indexing the files as a preparation process followed by searching. The indexing part will require that the files are read (spidered) and filtered to extract content and metadata. Once the index is prepared, the search engine's query interface can be used to get reults.

    Links to a couple of sites:
    http://www.searchtools.com/index.html
    http://searchenginewatch.com/

      Yes. That's what my code does.

      It builds an inverted index of all the content I want to search and then queries that.

      It's the searching this index that is the part I need to optimise.

      Thanks for the links, however I can't find anything helpfull and low level enough that doesn't only cover the problems I've already solved.. (TBH, I can't find anything with actual code / alogrithms that isn't a back of cigarette packet style demonstration.. My code can already do complex and/or/not searches with arbitary nesting using () in the search and any number of search terms.)
        Try Perlfect Search. A full search engine implemented in Perl so you get the source code and everything!

        If you have the space, one very effective way of speeding up the searching of your inverted index is to index it!

        Once you have created your inverted index, you then create a second index from the first. This indexes pairs of words. The keys are pairs of words from your primary index. The values are the pages that contain the pairings. This vastly reduces the number of pages associated with each key. The cost is the huge number of keys.

        A partial solution is to only pair unusual (low hit count words) with common (high hit count words) once you have excluded all the really common words ('a', 'the', 'it' etc.).

        If the search doesn't include any uncommon words, the secondary index doesn't help, but you find that out very quickly, and there is no alternative than going through all the hits.

        If the search consists of only uncommon words, then the results from the primary index will be minimal anyway.

        But when the search incldes one or more common and one or more uncommon, the process of intersecting the huge list from the common and the small list of uncommon at runtime is expensive. Pre-processing these can substantially reduce the runtime costs.

        It's fairly easy to setup but requires a substantial amount of (pre-)processing power to maintain.


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail
Re: Help on building a search engine (design / algorithms / parsing / tokenising / database design)
by andyf (Pilgrim) on Jun 07, 2004 at 11:45 UTC
    This node matching comments by Eugene has some pointers to comment extraction - be warned it's not as easy as it seems to get all the special weird cases, especially where there are multi-line comments or worse nesting coments. I would make the comment extractor conservative and cautious so you don't end up with a great tract of source code in your db by mistake.
    Andy
      Just what I was looking for, thanks :)

      Now I've just got the how to make it lightning fast question to resolve.
Re: Help on building a search engine (design / algorithms / parsing / tokenising / database design)
by perrin (Chancellor) on Jun 07, 2004 at 14:37 UTC
      Thanks, they're all more usefull than anything I'd found. Shame you have to not just register but pay money for the DDJ one.
Re: Help on building a search engine (design / algorithms / parsing / tokenising / database design)
by iburrell (Chaplain) on Jun 07, 2004 at 15:57 UTC
    Have you looked at Plucene, http://search.cpan.org/dist/Plucene/, a Perl port of the Lucene search engine.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://361937]
Approved by Happy-the-monk
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (2)
As of 2022-05-28 16:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Do you prefer to work remotely?



    Results (99 votes). Check out past polls.

    Notices?