in reply to Design flat files database

  1. what should work faster for access user dir with id 748332: /74/83/32/748332/ or /7/4/8/3/3/2/748332

    The reasoning behind placing files in directory structures formed by partitioning the name-space is to avoid huge numbers of files in a single directory which (on some file systems) have to be search linearly.

    Eg. If your 6-digit ID defines your ID space, then placing 1 million files in a single directory means on average, you have to inspect 500,000 entries to find the file you are looking for.

    But, if you split that into /xx/yy/zz.dat, then on average you will inspect 50 entries in the first level, 50 in the second and 50 in the final level. !50 inspections .v. 500,000 is a good trade.

    Using (a modified version of) your second schema /p/q/r/x/y/z.dat, it will (on average) be 5 in each of the 6 levels giving 30 inspections.

    The latter sounds like a good idea, but in practice the benefits can be outweighed by the complexities. This depends upon the actual file-system in use, and you will need to test to see what works best on your particular file-system.

  2. Files in linux directory are indexed

    Again, this depends upon the file-system in use. AFAIK, ext2/ext3 are not indexed (or hashed), but other *nix file-sytems may be.

  3. For example someone posted a message, what better : to save all the replies for this message in a singe file or save each reply in separate file in the folder that will be created for this message and when someone view the message to gather all the replies from the files

    Reading between the lines, I'm guessing your thinking of implementing a message-board type system (not unlike PM).

    If so, the "better" will depend upon many factors:

    • Will replies have their own IDs within the 6-digit ID space?
    • Will replies only be displayed subservient to their parent? Or will they be viewable individually?
    • Are replies to replies possible?

A comment: Your proposed schemas /74/83/32/748332/ & /7/4/8/3/3/2/748332/ both incorporate two levels of redundancy. There is no benefit in this.

Two questions:

If there a tutorial or book about flat files database it will be great !

The only paper I ever saw on the subject was an IBM RedBook, but that was 15 or 20 years ago, so my memory of it is vague. You could try searching that site, but I don't have any good keywords to offer you right now. Maybe some will come back to me.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^2: Design flat files database
by AlfaProject (Beadle) on Jul 15, 2011 at 12:07 UTC
    1. I am trying to find the sweet spot between directory access time and the number of levels.
    The folders hierarchy will be automatically created on the fly.
    I mean if there is only 2374 , so the last user will have folder 2/3/7/4/2374
    The same about 2 digits hierarchy, if there is more that million of id's another level will be added automatically.

    At that point it's not for specific filesystem, but if the system will grow up I will move it to server with the suitable filesystem.

    2.Ok , for now I'm using ext3 but I will move towards something better if needed.

    3.Yes , I'm building an facebook/google+ like online community project.
    I have already finished one project, it's working fast but when I benchmarked it there was some limitations on fequests per second but the CPU wasn't on 100% and still few gb of free memory...
    I think it because of HD I/O limitations.Because of that i want to understand how to make it super fast.

    q:Will replies have their own IDs within the 6-digit ID space?
    a:yes , the replies will have an uniq id

    q:Will replies only be displayed subservient to their parent? Or will they be viewable individually?
    a:there will be post->replies structure.
    Each post will have a file for it's replies
    They won't be viewable individually,but there maybe will be an option to edit or delete the reply .So if the replies for specific post will be stored in the same file , it will need to be rewritten in a case some one want to edit only 1 reply.

    q:Are replies to replies possible?
    a:no

    1-q: already answered .
    2-q: I have a good experience with flat files databases, I started to read a book about mysql optimization and at some point I was upset .
    It's not that simple as it seems to be. There is many tricks that need to know and that come with experience, to much options that I don't need.
    I just afraid that in some point I will get stuck with it or the performance will be poor at some point.

    Thanks for the informative post
      replies will have an uniq id... Each post will have a file for it's replies.

      There is a conflict there. If replies have uniq IDs, then their bodies will exist in the hierarchy, so what will you store in the "file of replies"?

      My suggestion would be to store symlinks to the replies in the directory holding the body of the parent. That way, you do not have to traverse the hierarchy from the root to find them (as you would if you store a file containing the reply IDs).


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        i didn't really understand , what i mean is :
        user dir:us/er/user2/
        posts file:us/er/user2/posts.txt
        replies folder:us/er/user2/replies/
        replies for post 12us/er/user2/replies/12.txt

        (txt extention is only for example)
        inside a post file posts stored as :
        id1|time|post_content\n
        id2|time|post_content\n
        id3|time|post_content\n
        if i want to read the last few post's , i will use a read backwards module, or something similar.
        the replies files have the same structure as post file only with name of the user replied.
        id1|time|user|reply_content\n
        id2|time|user|reply_content\n
        id3|time|user|reply_content\n

        I just wanted to know what is better , to store it like this or to make a file for each reply/post ?