in reply to Poor Person's Database

I second the advice of using DBM instead of flat files. Flat files tend to get quite messy, don't scale well, and are extremely difficult to maintain.

However, if you insist on using a roll-your-own flat file system, and you're sure that you only need to key on the first word I do have a suggestion. Instead of using the entire word for a directory (i.e. /www/search/KEYWORD) you might want to take it a step further, and use subdirectories based on the first few letters of each keyword. Your structure would then look something like:

KEYWORD       FILE
dartboard  => /www/search/d/a/rtboard.dat
doghouse   => /www/search/d/o/ghouse.dat
dog        => /www/search/d/o/g.dat
do         => /www/search/d/o.dat   
             (note how the suffix avoids colliding with the 'o' directory)
I'm guessing with a big dataset, you'll run into the limits of the number of entries in a single directory. (I hit that limit once on an old version of linux at 32,000) This way will speed up access (I think) and help you avoid the OS directory limits.

Again, I would use DBM if at all possible.

-Blake

Replies are listed 'Best First'.
Re: Re: Poor Person's Database
by grinder (Bishop) on Jun 20, 2001 at 11:38 UTC
    I second the advice of using DBM instead of flat files. Flat files tend to get quite messy, don't scale well, and are extremely difficult to maintain.

    Been there, done that, got the scars to prove it. Yes, you don't really want to go down this road if you can help it.

    However, if you are brave and insist on this approach, the sub-directory approach is the way to go. I would only recommend naming the subdirectories differently, to whit:

    being => /www/search/b/e/ing.dat suing => /www/search/s/u/ing.dat

    versus

    being => /www/search/b/be/being.dat suing => /www/search/s/su/suing.dat

    From bitter experience, I can tell you that one day someone will accidently copy files from one directory to another and your application will start to behave in an erratic manner that will be hard to track down.


    --
    g r i n d e r