Spida has asked for the wisdom of the Perl Monks concerning the following question:

How do I get the newest file out of a directory and its subdirectories?
I'm trying to write an automatic update-tool for the freedb cddb-server, and thats the only way I can think of to find out which db version is installed.
Because of the number of files (about 700 000) this thing is very time-critical.

Replies are listed 'Best First'.
Re: Get newest file
by Zaxo (Archbishop) on Oct 27, 2002 at 20:02 UTC

    That's a lot of files, and a lot of stat calls if you do this by brute force. How about changing the organization to put new files in a holding directory? You can then check dates, process, and move the files from there.

    If you must do this, check File::Find and note that the mtime of a directory matches that of its newest file (for static files).

    After Compline,
    Zaxo

Re: Get newest file
by Rich36 (Chaplain) on Oct 27, 2002 at 20:09 UTC

    You can use the module File::Find to traverse the directory structure and use stat to find the age of the file. The example below sets up a hash data structure that contains the name of the file and the last accessed modified time. The application then compares the values in the hash and prints out the newest file.

    #!/usr/bin/perl use strict; use File::Find; use vars qw/%files/; sub findNewestFiles { my $element = $File::Find::name; return if (!-f $element); $files{$element} = (stat($element))[9]; } ####################################################### # MAIN ####################################################### my $dir = '/home/users/rich36'; find(\&findNewestFiles, $dir); my $newestfile; my $time = 0; while(my ($k, $v) = each(%files)) { if ($v > $time) { $newestfile = $k; $time = $v; } } $time = localtime($time); print "The newest file is $newestfile : $time\n"; exit;

    «Rich36»

      Actually, after considering Zaxo and tadman's responses, you might be better off altering the code above to find the directory with the most recent modified time (Zaxo's suggestion), then use glob on that directory (tadman's suggestion) to find the most recent file. That might be more efficient given the number of files and directories that you are searching through because you would only be storing directories in the hash and the application would not be calling stat on all those files.


      «Rich36»
        Half of the files are in two directories, with more than 200000 each. With going for the directory with the newest modified time, that would save me from checking more than 65% of the total...
        as a work-around, I could just ask the user about the version he installed...
        AFAIK the freedb has no other method included to check the db-version...
Re: Get newest file
by tadman (Prior) on Oct 27, 2002 at 20:01 UTC
    If you mean newest as in time of modification, then you can do something like this:
    my ($newest) = sort { -M $b <=> -M $a } glob("*");
    Although untested, this is basically how you approach this sort of thing.

      That will stat all files more than once (remember, sort is N*log(N)), and will, particularly for large directories, break down horribly in performance. The least you can do here is a Schwartzian Transform or variant thereof.

      Also remember that globbing for * will miss any files starting with a dot - if that is of significance.

      I'd do something like this:

      my ($age, $name) = (0, ""); opendir DIR, $directory; while(my $curname = readdir DIR) { my $curage = (stat "$directory/$curname")[9]; next unless -f _; ($name, $age) = ($curname, $curage) if $curage > $age; } closedir DIR;
      But that still doesn't address the recursive nature of the original poster's task. Zaxo makes the right points.

      Makeshifts last the longest.

Re: Get newest file
by graff (Chancellor) on Oct 27, 2002 at 23:27 UTC
    Please consider looking at this node of mine from a few months ago (and at the thread that it's part of, too). The original question there was similar to yours, I think: someone wanted to be able to check a directory tree at regular intervals to locate all the paths and files that had been added since the preceding check.

    One of the points made there was that using the GNU "find" command with backticks or system() was a lot faster than File::Find on big file spaces covering many thousands of files.

    Another point was that, if you're doing this sort of checking at intervals, save listings of files (and mod times, if necessary) in each directory; then the next time you check, use "-newer list.file" to locate files more recent than the listing created during the last check.