Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

(OT) should i limit number of files in a directory

by leocharre (Priest)
on Sep 11, 2008 at 15:50 UTC ( [id://710657]=perlquestion: print w/replies, xml ) Need Help??

leocharre has asked for the wisdom of the Perl Monks concerning the following question:

I have a system that checks if a file exists, if not, the file is created.
Easy enough.
Now, the file count will be at least 100,000 and potentially 3 million in 12 months. every *filename* is a md5 hex sum digest thus, it is 32 chars long and each is 16 possible chars.

Space is not an issue here. These are small text files. I'm on GNU/Linux using ext3 partitions.

I'm considering if I should have a hack to keep the directory file counts to a minumum.
For example, pause does this with http://backpan.perl.org/authors/id/L/LE/LEOCHARRE/, notice L/LE/LEOCHARRE (yes, that is a directory not a file, let's be flexible here).

so if the file in question (that i will read or create) is named 'opuscows', it will really reside in either o/op/opuscows ( or more interestingly.. op/us/cows , then the first dir would have 256 entries, and every level would also have another 256 entries (16*2 for xx/) )

This would help keep my directory entries lower than say, 3 million.

This hack will slow down looking and writing, by a little bit.

But maybe this is not needed. I will not be searching for files, or doing a dir listing operation. The file is there or not.

Is there a limit to how many files I should have in a such a directory? I read that "There is a limit of 31998 sub-directories per one directory,..." - but this does not make mention of files in general.

Please excuse my broken up discussion.

update

After discussion in this thread, I am using mysql to serve the data instead of using a regular filesystem.

I had some text entries that were larger than 1M, this caused a problem for me at first. The default maximum packet for mysql is 1meg. You must change the max_packet_size in your mysql config file. Likely in /etc/my.cnf, you would add a 'max_packet_size=5M' line (for example) and restart the server.

Replies are listed 'Best First'.
Re: (OT) should i limit number of files in a directory
by merlyn (Sage) on Sep 11, 2008 at 15:57 UTC
    I wonder why you have 3 million files.

    Do you really need each of those chunks of data to have a name accessible to all other applications, and the metadata of last access, modified, and inode-changed, and permissions and ownership maintained by the operating system?

    Or perhaps, what you should do instead is create a database that stores only the metadata you need for each item, along with the item itself.

    PostgreSQL's "binary" columns should handle any data that you might stick into a file, and will scale and replicate nicely. And with a unique index on the data column, you won't even need to "MD5" the data to ensure only one version... just try to insert it, and if it fails, it's already there. Nice atomic test.

      Ah! Very interesting point.. the extra inode metadata! I didn't even consider that waste.. This is something that could indeed ammount to something with that many data chunks!

      Some of it could be useful, mtime, atime, ctime- could have use.. indeed, for trash collection and update purposes- possibly.

      I have 3 million files because I'm tracking documents in an office environment- for a buncha users who do bureaucratic work fo tha man. So.. there are a lot of freaking pdfs, docs, "excel spreddshits", hard copy doc scans... etc. A lot.

      I'm indexing everything about everything, runnig ocr and deet on every speck of junk here... Makes evyerbody's life easier and way more interesting.

Re: (OT) should i limit number of files in a directory
by kyle (Abbot) on Sep 11, 2008 at 16:12 UTC

    If it were me, I'd put them in directories based on the hashes. I'd put "d41d8cd98f00b204e9800998ecf8427e" in "d/4/1/d/8/d41d8cd98f00b204e9800998ecf8427e" (for example). At five levels deep, each leaf directory would have an average of three files in it (for three million files), so maybe you want just four levels with an average of 45 files each. The deeper you go, the more room to grow.

    I find it hard to believe you'll never do a directory listing. Eventually someone will do one on accident. We had a Linux machine where I work brought to its knees by an 'ls' in a directory with too many files. We thought it had died completely, but it eventually came back.

    It's possible that ext3 doesn't have this problem (I don't know), but on some filesystems even a check for existence involves a brute force search through the contents of the directory.

    Having looked just now, I see there's an option for 'mke2fs' called "dir_index" which "uses hashed b-trees to speed up lookups in large directories." Also, a "tune2fs -l /dev/sda1" tells me that my filesystem has this feature even though I don't recall asking for it. Maybe it's the default. It might be worth your while to look.

      Yes, doing ls on ext3 slows stuff down- still. I've not seen a slowdown that looks like a crash- but.. I have seen a pause.

      I keep my sshd running just in case of stuff like that.

      Your multilevel system makes more sense. I had the idea that if I had, say, two levels, I would pre write 256 dirs at level 1, then 256 more at each one of those, and then all the possible dirs would be made already. Helping the system along, so I wouldn't have to check that the target absolute location is there or not.

      (I'm so glad I asked about this- really impressed by the responses and ideas.)

      Hm... may be time to start scripting some .t s .. :-)

Re: (OT) should i limit number of files in a directory
by RMGir (Prior) on Sep 11, 2008 at 16:13 UTC
    merlyn's making a good point. This really sounds like a job for a database.

    But if you MUST use the filesystem, then yes, you'll definitely need to do something multilevel. Any operations on directories tend to suffer badly when the file count gets high, and "high" in this context is on the order of 10,000's, not millions.

    I'd strongly suggest NOT doing "op/us/cows" "optimization". If the full filename is in the leaves, a lot of operations get simpler (since you don't need to remember the path to the file).

    And if your filesystem ever got corrupted, you'd never be able to recover - you might be left with an orphaned directory full of files named "cows","goats",etc... with no way of knowing that they belong under op/us. With full filenames, you can survive "mid-tree" corruption without issues, assuming fsck rescues the orphaned data.


    Mike

      Thank you for the advice on not storing '27f1f49c9d06b5725abff58587d68b05' as '27/f1/f49c9d06b5725abff58587d68b05' - It was a cute and clever idea- but then so was seeing what would happen if I stuck two copper wires into an electric socket when I was 3 years old.

      Very helpful insight- really kept me from doing something stupid! Thank you!

Re: (OT) should i limit number of files in a directory
by tilly (Archbishop) on Sep 11, 2008 at 16:15 UTC
    The reason that PAUSE does that is that many filesystems use some variation on scanning a linked list for the directory entries. Therefore you really want to avoid having a single directory with hundreds of thousands of files in it.

    However you say you are on ext3. That filesystem uses an htree balanced tree for large directories, so it is internally already doing what you'd be trying to do.

    That said, merlyn is right. There is a lot of hidden overhead to having a small file in the filesystem. If you just want to record the existence of a md5 hex sum digest, that is a perfect application for a database, BerkeleyDB, or DBM::Deep.

      I do have a database keeping track of sums and using ids.

      I am not using this system merely to check existance. The files actually hold something. Data that does not belong in a database, as it is.

      It makes sense what merlyn and other said about storing in a database.
      Let's not forget that the filesystem *is* a form of database system. It's a data storage discipline.
      Some things are more appropriate on a fs then a db server.

      A million text files ranging in size from 1k to 486k etc.. would probably cripple a db system- it's too much of a variation.. maybe i'm wrong about that.

      There's no searching, no comparing, the size of each element is wildly varied... It feels like a fs thing..

        In your original post you said that you were just using the filename to check existence. If it has data, then a file is more reasonable. However I would still suggest looking at something like DB_File's interface to Berkeley DB.

        That's designed to store data of exactly this type. Its data limits are 4 GB per entry, and 256 terabytes for the entire dataset.

        If you want to store the data on one system and use it on another, then you might want to move up to a database. Sure, there are things like NFS. But if someone goes innocently looking at a directory like that using standard tools over a networked filesystem and you'll be putting everything through an "interesting" stress test. Plus even though it works today on ext3, that's no guarantee that in 2 years someone won't migrate the system to another system and not understand that that directory really, really needs to be a specific filesystem.

        While I agree that there are things that belong on filesystems, this feels to me like something that would be happier not living on a filesystem. But if you put it there, then I'm going to suggest that your disks will be happier if you turn off maintenance of last access time in that directory. That information is almost never used, and causes every read of a file to write to the directory. If you're under load this can be a significant cause of overhead.

        I still think merlyn is right -- a db is the way to go. blobs are not the most elegant/efficient mechanisms, but they are very easy to find based on a key. As long as your blobs stay below about 1MB, mysql or postgres should be fine. Trying to find a single file in a directory hierarchy of millions of entries is going to suffer significantly worse performance.
Re: (OT) should i limit number of files in a directory
by dsheroh (Monsignor) on Sep 11, 2008 at 16:42 UTC
    You seem fairly convinced that a database is not the right answer here. I'll take your word for it.

    You'll definitely need to go multilevel for that many entries, even if no (human) user ever enters or does an 'ls' on the directory. Opening a file in the directory still requires finding the file and doing four searches through a few thousand files each is going to be a hell of a lot faster than doing one search through a couple million unless the names are indexed specifically for searches in a way that filesystems generally don't do.

    Also, since it hasn't been mentioned yet, and I realize you may have thought of this already, but... inodes. Unless you've specifically tuned your fs to have a higher-than-default inode density, it may not be able to support 3 million files regardless of how large or small those files may be or how they're organized. 'df -i' will tell you how many inodes the filesystem has. (Why, yes, I have had a print server grind to a halt, claiming the fs was full when 2/3 of the space was unused. How did you guess? CUPS had been forgetting to clean up after itself and consumed all available inodes with 0-byte files.)

      Holy cow... I'm seeing things like .. 8 million inodes, 4 million.. that's not a lot..

      I'm starting to reconsider my db/fs stance. It *would* make a bunch of other stuff easier to use mysql- like, querying across the network. It seemed like a low class thing to do- storing all those text files in a db... hmm. I think I could keep them under 1 meg each.

      Eek.. If you'll excuse me.. I think I'm gonna go ask about the print server...

Re: (OT) should i limit number of files in a directory
by tirwhan (Abbot) on Sep 11, 2008 at 16:54 UTC

    Lots of valid points brought up above (by merlyn et al), I'll just add my two cents of information that no-one has touched on yet.

    When creating an ext3 filesystem, mkfs.ext3 reserves a fixed number of inodes (calculated using of the filesystem and block size). That is the limit of number of files you can have on this file-system. If the partition you are on is very large this won't be an issue for you, but you should probably check whether you're in danger of reaching that limit, execute "df -i" for that. The only way to increase the number of available inodes is to recreate the filesystem (with mks.ext3 --number-of-inodes $n), you can't change it on an existing filesystem.

    ReiserFS (v3) is a lot better than ext3 at handling small files (wastes less space and is a lot faster), and it also does not have the inode limit problem, so you might want to try using a Reiser partition if you want to go on with the system as it is.


    All dogma is stupid.
Re: (OT) should i limit number of files in a directory
by mr_mischief (Monsignor) on Sep 11, 2008 at 16:31 UTC
    If you need to do this in the filesystem, which merlyn makes good points against, then you definitely don't want to steal characters from the filenames as RMGir says.

    You probably also want to use "/op/opus/opusco/opuscows" rather than just "/op/us/co/opuscows" too, for the same reasons. It's possible to rebuild the whole directory system based on just the file names as long as the file names are intact, but the directory names being salvageable as well will help when something goes wrong.

    Is there any data in these files? Are these hash-named files the files you're checking for existence, or are you using them to track the existence of other files? 16**32 is much larger than 3 million, so I'm guessing these are hashes of other files.

    Is this a tracking system to see if files have been inserted into a document management system? If so, you'll have issues if the documents are editable because the MD5 sum will change. You'd have to delete the hash for the old version before the edit starts and recreate it after the edit every time. It might be easier to store both the document and the hash for it in a database if you're doing something like that.

      There is data in these files, yes. Some have a little bit (1k) some a lot (up to maybe 400k.. not much more).

      I am not using them to keep track of existance of other files.

      The data itself is of interest, in regards to the hash/ the digest /the filename of this metadata (with no structure).

      "Is this a tracking sys..." Yes and no. I am *expecting* for the document names (locations), hosts, and data to change. If that happens, then that document is no longer the same document- it's irrelevant. ( I know.. that's a very concise summary of what's up- That discussion is a very large and involved one. )

Re: (OT) should i limit number of files in a directory
by BrowserUk (Patriarch) on Sep 11, 2008 at 19:00 UTC

    Storing large lumps of binary data in aa RDBMS makes no sense. You cannot use relational logic on it; it just takes more space on the filesystem; and takes longer to access.

    From the experience of a project a few years ago, using the 3 level deep filesystem hierarchy with MD5 checksums distributes the files very evenly. It is almost guarenteed by the very nature of MD5. For your project, you end up with 4096 directories with 750 files in each.

    We had 10 of millions of files in a 4 deep hierarchy (on a linux system) and lookup was fast and reliable. Loading the data the same. And the time it takes to produce the path from a given MD5 is negligable.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: (OT) should i limit number of files in a directory
by jdrago_999 (Hermit) on Sep 11, 2008 at 17:47 UTC

    I've done exactly what you are looking to do (several times).

    Say your filename is 45624a44b89793087e9ef4d076018adb. Under /var/media (or whatever) make a folder like 4/5/6 and place 45624a44b89793087e9ef4d076018adb under that.

    You end up with lots of folders.

    Even better, make a folder named 45/62/4a and place your file in that. URI-to-Disk resolution is a snap.

    my $ROOT_PATH = '/var/media/'; my $folder = $ROOT_PATH . join '/', $r->uri =~ m/^(..)(..)(..)/; # Folder is '/var/media/45/62/4a'

    Taken further, you end up with 256^3 folders. That should be plenty (16.7 Million different folders in which to place your files!).

Re: (OT) should i limit number of files in a directory
by shmem (Chancellor) on Sep 11, 2008 at 20:12 UTC
    Is there a limit to how many files I should have in a such a directory?

    There certainly is a limit, but before you ever hit the limit you will hit a slowdown in filename lookup caused by too many levels of indirect blocks in the directory structure which holds the file names. But that's not the main point.

    You definitely want to keep the limit much lower, just in case you want to get rid of that directory. See Re^5: greater efficiency required (ls, glob, or readdir?)

Re: (OT) should i limit number of files in a directory
by MidLifeXis (Monsignor) on Sep 11, 2008 at 21:07 UTC
Re: (OT) should i limit number of files in a directory
by graff (Chancellor) on Sep 12, 2008 at 05:47 UTC
    Given that you have a database keeping track of things (one record per file, right?), and you have data that belongs in files, another idea to consider is: just concatenate your current "file-unit" data chunks into a smaller number of larger files. The database record for each unit can take a couple extra columns to hold the byte offset info (start and end, or start and length).

    When you are storing each data unit to disk, just append to an existing file until that file reaches some maximum reasonable size, and keep track of the file name and byte offsets for that unit. Once one file gets big enough (one or two gigs would be good), start a new one. This could get a bit less simple if you have multiple processes or threads writing different data units at the same time, but it won't be that much harder -- just set up a way of apportioning or assigning output files to each process (that's where directory trees would be handy).

    You'll be using a lot fewer inodes and your directories will be smaller. When you go to fetch data back from the files, there will be less filesystem navigation, fewer file open/close operations on average, and more use of seek(), which would be a Good Thing.

    (Update: it would be prudent to worry about the risk of lost or corrupted byte offset info, so you might want to supplement that with some sort of distinctive record delimiter as part of the concatenation process -- but this would depend on your data: how confident can you be about coming up with some sort of pattern that you know will never occur as data within a given record (leading to a "false-alarm" boundary)? If you can be completely confident about that, then there won't be any problem. It could be as something as simple/silly as a 128-byte record with even values 00-FE in ascending order.)

    (Second and final update: Bear in mind that the above idea really just amounts to implementing your own little BLOB attachment on your existing database. If you are already using a database that doesn't have good BLOB support -- and if there's inertia that disfavors changing the DB server -- then concatenating files is not such a bad fall-back approach. But actually, I'd go with merlyn's advice on this, if you happen to have the DB for it.)

    When you say:

    I will not be searching for files, or doing a dir listing operation.

    Well, maybe you personally won't be doing that, but what about everybody else? (Like maybe the nightly backup job? Your system has one of those, doesn't it?) There tend to be a fair number of routine sysadmin tasks that involve traversing whatever directory tree you assemble, and this will usually involve "find" and other tools that are surprisingly bad at scaling up beyond a certain order of magnitude, especially when it comes to the number of file entries in a single directory. I've seen it happen, and I assure you, you do not want to go there.

Re: (OT) should i limit number of files in a directory
by ohcamacj (Beadle) on Sep 13, 2008 at 04:08 UTC
    Regardless of how easy it would be to store 3 million files in one directory on an ext3 filesystem, it would certainly be possible on an XFS filesystem.

    XFS has been designed from the ground up for massive scalability, and has been in the mainline linux kernel since 2.4.25 or so.

    XFS has dynamically allocated inodes, so running out inodes is never a problem.

    I have personally used XFS as the root filesystem on my home computers for years, and never had a single problem with it.
Re: (OT) should i limit number of files in a directory
by DrHyde (Prior) on Sep 12, 2008 at 10:13 UTC

    Does it matter if you have 3 million files in the directory? Have you actually tested to see if it'll be a problem?

    With three million files, ls will be slow, but on a reasonable filesystem I would expect that file access speed would be immeasurably different from in a directory with only a thousand files.

    As an example, I have a directory here which has tappity-tap 110,000 files. It takes a few seconds to ls. But statting a particular file is as near as damnit instantaneous.

Re: (OT) should i limit number of files in a directory
by sgt (Deacon) on Sep 12, 2008 at 12:21 UTC

    Hi Leo,

    I would use a simple n-level deep scheme which consists of a rootdir with subdirs a-z where each contain dirs a-z (this repeated n times), and an index (plain file or dbm). As many have noted such a scheme transforms the usual linear search in a directory in something closer to a binary search.

    An iterator would give the dir-part a/a/b, a/a/c, ..., a/a/z, a/b/a,... and start over when the list is exhausted; this way it is easier to keep the entries (almost) equally distributed especially if a few processes are writing concurrently in your (virtual) filesystem.

    The index key would be the name of the file. Additional meta-info can be attached easily

    for example

  • key 7644ebf125065a6c220dcd35b5190e57 => a/a/b/7644ebf125065a6c220dcd35b5190e57, a/a/b/7644ebf125065a6c220dcd35b5190e57.1st_pass, etc
  • cheers --stephan

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://710657]
Approved by moritz
Front-paged by derby
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (4)
As of 2024-03-29 00:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found