in reply to how to list the files in dir with respect to time

I appreciate that swaroop prefers not to use Unix commands, and I've also noticed that using the stat operations should be platform agnostic provided that platform is supported in perl, where a Unix command would be tied to a Unix variant and would likely not work under, say, Windows (without cygwin or other variants).

However, given a restraint that says we will be used under Unix, is the overhead of a system call worse than stat'ing each and every file, which must at some point be converted to a system call as well? I would think that the `ls -lrt` option would have to be a more CPU time efficient operation.

Yeah, I should run a benchmark. I've resisted this for a long time, but there's no time like the present... I'll post an update when I get done; unless someone beats me to it.

-Scott

Update:

#!/usr/bin/perl use strict; use warnings; use List::Util 'reduce'; use Benchmark qw(cmpthese); our $path = shift || '/usr/bin/*'; sub a { my $newest = ( reduce { $a->[1] < $b->[1] ? $b : $a } map { [ $_, (stat)[9] ] } glob "$path" )->[0]; } sub b { my %file_date; for ( glob "$path" ) { $file_date{$_} = (stat)[9]; } my $newest = ( sort { $file_date{$b} <=> $file_date{$a} } keys %fi +le_date )[0]; } sub c { use List::Util 'reduce'; my $newest = reduce { (stat $a)[9] < (stat $b)[9] ? $b : $a } glob "$path"; } sub d { my %file_date; my $max = -99999999; # set it to first file's mtime would be bette +r # but just for demonstration here my $mtime; my $file; for ( glob "$path" ) { $mtime = (stat)[9]; if ($max <= $mtime) { $file = $_; $max = $mtime; } } my $newest = $file; } sub e { my @array = `ls -lrt $path`; my $newest = $array[-1]; } cmpthese(250, { zaxo_first => \&a, graff_hash => \&b, many_stats => \&c, sk_maxfile => \&d, the_ls_lrt => \&e, });

             Rate graff_hash zaxo_first many_stats sk_maxfile the_ls_lrt
graff_hash 25.3/s         --       -14%       -21%       -40%       -84%
zaxo_first 29.3/s        16%         --        -9%       -31%       -81%
many_stats 32.3/s        27%        10%         --       -24%       -79%
sk_maxfile 42.4/s        67%        45%        31%         --       -73%
the_ls_lrt  155/s       513%       430%       381%       266%         --
Ok, I think something's off, I'd expect the "many_stats" to be the worst performing option, but it turns out to be not so bad. This is run on our /usr/bin directory with 1643 files...

Replies are listed 'Best First'.
Re^2: how to list the files in dir with respect to time - benchmarks
by 5mi11er (Deacon) on Aug 03, 2005 at 22:29 UTC
    Ok, after sleeping on it, and re-examining the code, I've got a better handle on the benchmarks.

    My first mistake was not fully reading the "many_stats" routine; it is using the List::Util::reduce routine, not the sort routine as I'd assumed/glossed over. So, I added a test that did use the worst non-contrived combination of stat and sort, and THAT gave me the results I expected.

                 Rate   most_stats  graff_hash  zaxo_first  more_stats  sk_maxfile  the_ls_lrt
    most_stats   4.64/s       --         -82%        -84%        -86%        -89%        -97%
    graff_hash  25.6 /s      452%         --         -14%        -22%        -41%        -82%
    zaxo_first  29.8 /s      543%         16%         --          -9%        -31%        -79%
    more_stats  32.9 /s      608%         28%         10%         --         -24%        -76%
    sk_maxfile  43.3 /s      832%         69%         45%         32%         --         -69%
    the_ls_lrt 140   /s     2909%        445%        368%        325%        223%         --
    
    Doing two stats for every compare in the sort routine is REALLY bad, and the next worst option is graff's caching the file dates in a hash, then sorting.

    Then we have the two attempts using List::Util::reduce; I don't quite understand how zaxo's caching of dates can be worse than stat'ing during each compare. My only guess would have to be the setup of the 2 dimensional array and all the dereferencing going on creates enough of a penalty that they out-weigh the stat calls.

    Then we see that sk's function to pull only the newest file out as we're going through the array is a bit better than the reduce options. And finally letting the system's 'ls' routine do most of the work for us is far and away the best option (ignoring portability issues).

    Code follows:

    -Scott