opendir slower than ls on large dirs?

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I'm trying to check if a file is in a directory with a lot of files (about 15 000). This is done often and I can't keep my filehandle or cache the result so I need to check the dir from scratch everytime.
The script isn't originally done by me and uses `ls ...` which I figured should be slower than opendir() but it seems as it's not. Opendir beats ls on directories containing a small amount of files but slower on directories with many files. Can anyone explain to me how shelling out can be quicker than opendir and grep?

What I find really confusing is that reading/rewinding an already open directory isn't much faster than open/close. The output i get is:

jmo@foo:~> ls /some/small/dir|wc -l
322
jmo@foo:~> ls /some/large/dir|wc -l
12337
jmo@foo:~> perl ls.pl 
Benchmark: timing 5000 iterations of allready opendir on small dir, ls
+ on small dir, opening dir on small dir...
allready opendir on small dir:  3 wallclock secs ( 2.83 usr +  0.50 sy
+s =  3.33 CPU) @ 1501.50/s (n=5000)
ls on small dir: 30 wallclock secs ( 0.60 usr  2.48 sys + 11.90 cusr 1
+6.85 csys = 31.83 CPU) @ 1623.38/s (n=5000)
opening dir on small dir:  4 wallclock secs ( 3.31 usr +  0.61 sys =  
+3.92 CPU) @ 1275.51/s (n=5000)


Benchmark: timing 5000 iterations of allready opendir on large dir, ls
+ on large dir, opening dir on large dir...
allready opendir on large dir: 102 wallclock secs (87.82 usr + 13.04 s
+ys = 100.86 CPU) @ 49.57/s (n=5000)
ls on large dir: 57 wallclock secs ( 0.51 usr  1.99 sys + 27.62 cusr 2
+7.09 csys = 57.21 CPU) @ 2000.00/s (n=5000)
opening dir on large dir: 101 wallclock secs (88.11 usr + 12.37 sys = 
+100.48 CPU) @ 49.76/s (n=5000)
[download]

The code I run look like this:

use Benchmark;
use strict;
my $dir = '/some/small/dir';
my $file = 'somefile*';
opendir (F, $dir);

timethese(5000, {'ls on small dir' => \&foo, 'allready opendir on smal
+l dir' => \&bar, 'opening dir on small dir' => \&baz});

$dir = '/some/large/dir';
$file = 'somefile*';
print "\n\n\n";
closedir F;
opendir (F, $dir);
timethese(5000, {'ls on large dir' => \&foo, 'allready opendir on larg
+e dir' => \&bar, 'opening dir on large dir' => \&baz});
closedir F;

sub foo
{
    my $a = `ls $dir/$file &>/dev/null`;
}

sub bar
{
    my (@files) = (grep (/$file/, readdir(F)));
    rewinddir (F);
}


sub baz
{
    opendir (DIR, $dir);
    my (@files) = (grep (/$file/, readdir(DIR)));
    closedir F;

}
[download]

Comment on opendir slower than ls on large dirs? Select or Download Code

Replies are listed 'Best First'.
Re: opendir slower than ls on large dirs? by BrowserUk (Patriarch) on Jul 05, 2005 at 10:40 UTC
Have you tried using glob? It seems to run quicker that grep/regex/, readdir on large dirs on my system when using a wildcard. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.	[reply]
Re^2: opendir slower than ls on large dirs? by Anonymous Monk on Jul 05, 2005 at 11:59 UTC
Glob was the sollution! It took less than 1s with the same tests. It's still kinda scary that opendir is so slow.	[reply]
Re^3: opendir slower than ls on large dirs? by BrowserUk (Patriarch) on Jul 05, 2005 at 12:08 UTC
It's not that opendir is slow, it's calling readdir, building the list, passing that list through to grep, which starts the regex engine once for every file in the list, building the output list and assigning it to the array. With large lists, much of that time would expanding memory to stages (powers of 2?) to accomodate the lists as they grow. With glob, everything is done within the C code which is entered and exited once. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.	[reply]
Re: opendir slower than ls on large dirs? by anonymized user 468275 (Curate) on Jul 05, 2005 at 10:07 UTC
Wouldn't `(-f $dir/$file)` be a faster test for the existence of a (ordinary) file in a directory than either of the other methods? To explain the comparative behaviour roughly: shelling out has a fixed overhead; the ls command will have been optimised for the platform. opendir/readdir on the other hand has variable overheads, is probably not optimised for a fixed platform, so will begin cheaper but grow with the filecount to eventually overtake the fixed overhead of shelling out. One world, one people	[reply] [d/l]
Re^2: opendir slower than ls on large dirs? by Anonymous Monk on Jul 05, 2005 at 10:18 UTC
it is by gazillions faster than the things I tried, but i need a wildcard and -f doesn't support that afaik. Thanks for the explaination of why ls is quicker on large dirs tho.	[reply]
Re: opendir slower than ls on large dirs? by merlyn (Sage) on Jul 05, 2005 at 13:58 UTC
`my $file = 'somefile'; ... my (@files) = (grep (/$file/, readdir(F)));` [download] I'm not exactly sure how it flaws your benchmark, but that's definitely not going to do what you intend it to do. You're looking for any filename that contains* "somefil" followed by zero or more letter e. Do you want that, or do you instead want: `/^somefile/`? -- Randal L. Schwartz, Perl hacker Be sure to read my standard disclaimer if this is a reply.	[reply] [d/l]
Re^2: opendir slower than ls on large dirs? by Anonymous Monk on Jul 06, 2005 at 14:22 UTC
heh, you're right in that i want /^somefile/ not somefile*. Is glob the fastest way to find files in a directory? The opendir, grep ... readdir did I take from "perldoc -f readdir" and it seemed like a nice sollution, but I'm skipping it for glob. What I was trying to figure out in the first place was "what's the cheapest way of listing a bunch of files in a directory" and it seems to me like glob is the way to go. Thanks everyone for all the answers	[reply]
Re: opendir slower than ls on large dirs? by neniro (Priest) on Jul 05, 2005 at 09:48 UTC
In case of foo() you just slurp the output of ls into a single variable, in both other cases you get arrays as result. You compare apples to pears.	[reply]
Re^2: opendir slower than ls on large dirs? by Anonymous Monk on Jul 05, 2005 at 10:06 UTC
switching @files for $files doesn't change speed. Also, opendir with or without pre-open is 6-10 times quicker on a small dir but twice as slow on a large dir.	[reply]
Re^3: opendir slower than ls on large dirs? by gellyfish (Monsignor) on Jul 05, 2005 at 10:28 UTC
Er, no because an intermediate list is still being built. I think the fairer comparison would be to assign to an array in the `foo` subroutine. Of course a `readdir` and a `grep` is not at all equivalent to `ls <file>` which will immediately see that `$file` exists (or not) and will not do any further searching, in the pure perl code you are going through every directory entry irrespective of whether you have seen the file or not. You will see somewhat similar performance hit if you did `ls $dir \| grep $file` /J\	[reply] [d/l] [select]
Re^4: opendir slower than ls on large dirs? by Anonymous Monk on Jul 05, 2005 at 13:02 UTC
Re: opendir slower than ls on large dirs? by Anonymous Monk on Jul 05, 2005 at 13:10 UTC
It's the grep using a regex that's slowing you down. You would expect that, for a large directory, and assuming what you're looking for isn't cached, the bottleneck is going to be the disk overhead, and whether you are going to call `ls` or not is going to be minimal. However, in your `readdir` solutions you are going to do work in Perl space for each file returned, due to your grep. `grep $_ eq $file` would be better than `grep /$file/`, but you're still doing Perl work for each file returned. If all you want to know whether a certain file exists, use `-e`. That's going to be the fastest, and will tell you exactly that. On Unix operating systems, `-e` on a large directory will still be slower than doing it on a small directory, and that's due to Unix decision to store filenames unsorted in a directory (storing file sorted makes operations in a large directory faster, but then those operations would be slower in a small directory). Now I was a bad boy and generalized Unix - which is not a smart thing to do, because Unix means a gazillion ways of doing the same thing, every way slightly different than the other, so no doubt there are a few file systems out there who do store files differently. I think Windows directories store files unsorted as well, but I could be mistaken. Lesson to be learned: do not create large directories! (Directories are like drawers: the more stuff you have in it, the harder it is to find something).	[reply] [d/l] [select]
Re: opendir slower than ls on large dirs? by TedPride (Priest) on Jul 05, 2005 at 12:30 UTC
I'm trying to check if a file is in a directory with a lot of files (about 15 000). Can't you just do this? `if (-e $filepath) { ... }` [download]	[reply] [d/l]


The stupid question is the question not asked
	PerlMonks