Marcello has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I have a directory which can contain over 10.000 files, and files are added constantly. What I what is to get the oldest file (first created), which I then process and remove. For performance reasons, I do NOT want to use @files = readdir(); because this slows down my program enormously with 10.000 files to process everytime. I now use readdir to read the first file and process it. It appears that this is always the latest created file, so LIFO (Last In, First Out). But I want FIFO (First In, First Out). Is there a fast way to get the oldest file in a directory?

TIA

Grtz Marcello

  • Comment on How to get the oldest file in a directory without reading all files?

Replies are listed 'Best First'.
Re: How to get the oldest file in a directory without reading all files?
by grinder (Bishop) on Oct 17, 2001 at 00:31 UTC

    In the general case, you have no choice but to read (and stat) all the files in the directory, remembering the oldest one you have seen, in order to find the oldest file.

    Without relying on a lot of implicit assumptions on how directory slots are allocated and freed in the face of creating and deleting files, you are not going to be able to produce anything robust.

    How often do files get created? How often do you need to fetch the oldest file? Maybe you could get away with scanning the whole mess once every minute, and create a symbolic link from the oldest file to a file of a fixed name. That way you just have to open 'oldest.file'. If you are on a braindead operating system that does not implement symbolic links, you can emulate them by opening 'oldest.file' and writing the name of the oldest file that the scan turned up.

    Another idea would be to cache the work. Once you have read the 10 000 files, write out the epoch time and the file name into a file name 'age.cache'. Each time new files come along, see if the oldest files are still around, and drop them from the file if they are not, and add the newest files onto the end. That way when you need to find the oldest file, it's the first record in the file. On second thoughts, this would be a nightmare to get to run reliably.

    <update> thinking some more about this question last night led to the following point: Unix and NT systems will update the last modified date of a directory each time a file is created or deleted. This allows you to have a dirty-bit flag, to at least know if anything has changed since last time you looked at the files. But note that under NT (and I'm talking NT 4 here), this behaviour is configurable in the kernel. You can choose to turn this off if you want. I have a NT server at work that runs under a crushing load, and this is one of the speed optimisations I made. But you probably know if you did such a thing, and it's easy to test whether it is the case.

    Also know that you'll have less of a performance hit (read: memory spike) if instead of doing my @files = readdir(DIR) you do something like:

    my $oldest = time; my $oldest_file = undef; while( defined( my $file = readdir(DIR) )) { # or use File::Spec for extra portability below next if $file eq '.' or $file eq '..'; $age = (stat $file)[9]; if( $age < $oldest ) { $age = $oldest; $oldest_file = $file; } }

    That is, loop through entry by entry rather than sucking the 10 000 entries in one hit, to reduce your memory footprint.</update>

    Above all, note that having 1e5 files in a single directory is a pretty bad idea, and one that should be avoided at all costs. You should try saving files out into separate directory, based on the age of the file. If you divided the epoch time by 21600, you would be adding four new directories per day (once every six hours). Right now, the directory name would be 46447. You would then only have to go the lowest numbered directory and search within, thereby drastically reducing the number of files you would have to stat.

    Otherwise, get a database.

    --
    g r i n d e r
Re: How to get the oldest file in a directory without reading all files?
by stefp (Vicar) on Oct 17, 2001 at 02:33 UTC
    First, most filesystems don't handle well too much files in a directory. Some do like reiserfs on linux.

    If you have control on the adding process in your directory, you can name the file so that the first in lexicographically order is the oldest. With a smart filesystem like reiserfs the time spent to find a file.will be O(log(n)) and not O(n).

    I stress it again: if you don't have a smart file system, it is madness to shove 10000 files in a directory.

    With a dumb filesystem, you must do what the filesystem should have transparently done for you (by implementing internally a directory as a tree). . For example you create a hierarchy and you create the file in the right place: if it arrives at 3:04 you create the file in /var/spool/whatever/03/04. Then finding the oldest file is just done by walking the tree.

    -- stefp