Cody Pendant has asked for the wisdom of the Perl Monks concerning the following question:

I'm working on a small search engine in a Perl environment where I can't use databases, it has to be flat files.

The search engine structure I've come up with involves an index with a word,documentnumber,documentnumber structure on every line, like so:

alpha,0,1,2
bravo,2,3,4
charlie,3,4,5
(etc)

where the word "alpha" is in documents 0,1 and 2, the word "bravo" is in documents 2,3 and 4 and so on.

So the structure as it stands requires me to open a 500kb file with hundreds of words/lines in it, and process every line to see if it starts with the search term(s).

So it occured to me yesterday that it might actually be easier to have a structure where I actually had hundreds of files, where the keyword was the filename and the content was the list of documents, so rather than (pseudocode):

open(large file) for each line of the file { put it into an array see if item 0 of the array matches the search term grab the document list if it does }
I could just do something like:
if(a file exists called "/www/search/$searchterm"){ open it and grab the document list }
so am I crazy or what?

I guess the factors are:

It's obviously a very messy solution, in that I'd have a folder stuffed with a large number of very small files, but in terms of doing less file-reading, less I/O, are there any gains?

Replies are listed 'Best First'.
(Ovid) Re: Poor Person's Database
by Ovid (Cardinal) on Jun 20, 2001 at 09:31 UTC

    For a cleaner, yet slow solution, check out DBD::CSV. This module allows you to use SQL with CSV files. Very handy if you don't have a database installed! It won't be terribly fast or robust, but certainly no worse than what you are currently faced with and if you have the option to convert to a real database later on, the conversion is a snap!

    One of the interesting features about this is that it allows you to create and drop tables. Ordinarily, this is not a function supported by DBI as this is too database specific. However, since this function would be tedious to implement by hand with CSV files, this module handles it for you. Read through the documentation as there are some pitfalls there, but I suspect that they are much less serious than doing this by hand.

    Cheers,
    Ovid

    Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.

      I second the suggestion to check out DBD::CSV.

      The idea of being "SQL like" and having the upgrade path to a "real" SQL is certainly a plus.

      As to performance, I've found that for reasonably small tables, selecting is no slower than for a real db since there's no overhead of talking to the database. Insert/Update/Delete are murder for large files since I think it reads/writes the entire file for each operation.

      But since the data is in a plain text file, for large update operations (like building the search index), you could probably do a Perl-only choice that would be very fast to write the file, then SQL to read.

      And finally, when you don't want to do SQL, you can always use vi to "update the table".

Re: Poor Person's Database
by nysus (Parson) on Jun 20, 2001 at 09:16 UTC
    Have you considered using the built-in (with Unix) DBM flat-file database method? This allows you to store hashes in a file and would greatly increase the access speed to your data. If you have "Learning Perl" or "Perl Cookbook", read up on this topic. I think it's exactly what you are looking for.

    $PM = "Perl Monk's";
    $MCF = "Most Clueless Friar Abbot";
    $nysus = $PM . $MCF;

Re: Poor Person's Database
by perigeeV (Hermit) on Jun 20, 2001 at 15:21 UTC
    Designing a Search Engine on perl.com from April. The article deals with exactly what you're asking.

    Using flat files is usually the painful consequence of a direct order from above. I agree with the above posts on DBM files. One concern I might point out about using multiple files instead of few large ones is that your operating system will waste a good deal of space due to minimum file size.

      That was definitely a relevant article. I had never before noticed Search::Dict. That said, as soon as I looked at Search::Dict I found a bug and realized that the API should allow you to send in a comparison function rather than just assume that its two flags cover all cases people will see. But still it is a nice thing to know about.

      Bonus marks to the next person who spots the bug. (Yeah, real text files won't tend to hit it, but still I think core modules should get this kind of thing right...)

        Well, there's a bug where if the search is trying to return the last line in the file, and there's no newline, it'll chop the last character off. Did you mean something else?
Re: Poor Person's Database
by blakem (Monsignor) on Jun 20, 2001 at 10:20 UTC
    I second the advice of using DBM instead of flat files. Flat files tend to get quite messy, don't scale well, and are extremely difficult to maintain.

    However, if you insist on using a roll-your-own flat file system, and you're sure that you only need to key on the first word I do have a suggestion. Instead of using the entire word for a directory (i.e. /www/search/KEYWORD) you might want to take it a step further, and use subdirectories based on the first few letters of each keyword. Your structure would then look something like:

    KEYWORD       FILE
    dartboard  => /www/search/d/a/rtboard.dat
    doghouse   => /www/search/d/o/ghouse.dat
    dog        => /www/search/d/o/g.dat
    do         => /www/search/d/o.dat   
                 (note how the suffix avoids colliding with the 'o' directory)
    
    I'm guessing with a big dataset, you'll run into the limits of the number of entries in a single directory. (I hit that limit once on an old version of linux at 32,000) This way will speed up access (I think) and help you avoid the OS directory limits.

    Again, I would use DBM if at all possible.

    -Blake

      I second the advice of using DBM instead of flat files. Flat files tend to get quite messy, don't scale well, and are extremely difficult to maintain.

      Been there, done that, got the scars to prove it. Yes, you don't really want to go down this road if you can help it.

      However, if you are brave and insist on this approach, the sub-directory approach is the way to go. I would only recommend naming the subdirectories differently, to whit:

      being => /www/search/b/e/ing.dat suing => /www/search/s/u/ing.dat

      versus

      being => /www/search/b/be/being.dat suing => /www/search/s/su/suing.dat

      From bitter experience, I can tell you that one day someone will accidently copy files from one directory to another and your application will start to behave in an erratic manner that will be hard to track down.


      --
      g r i n d e r
Re: Poor Person's Database
by TheoPetersen (Priest) on Jun 20, 2001 at 16:37 UTC
    By naming each file for its search term you are basically moving the search and lookup logic to the file system. While file systems are pretty good at looking up file names, you're not necessarily saving anything that a non-linear search of one big file would cost.

    To search a file such as you described non-linearly you need to either build an index structure for it (which tells your program, say, what byte to seek to for search terms beginning with the letter 'k') or perform a binary search on the file itself while taking the irregular record structure into account. The latter approach is kind of fun to program; you seek to the middle of the file, cruise forward to the next newline, and then read the next line to see what keyword you are on. Adjust the target forwards or backwards and seek again.

    Note that I said it was fun to program, not blindingly fast :) But it should be faster than a linear search for most data.

Re: Poor Person's Database
by Zaxo (Archbishop) on Jun 20, 2001 at 16:13 UTC

    For an example of an efficient filesystem database, see the paths to distribution tarballs or most anything else on CPAN.

    After Compline,
    Zaxo

Re: Poor Person's Database
by bikeNomad (Priest) on Jun 20, 2001 at 18:58 UTC
    Not an ad or anything, but Perlfect search might be useful by itself or as an example. It uses DBM files and binary packing of file numbers to save space.

    I have no connection with them other than having clubbed their code with an optimization stick.