Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

opendir slower than ls on large dirs?

by Anonymous Monk
on Jul 05, 2005 at 09:43 UTC ( [id://472398]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I'm trying to check if a file is in a directory with a lot of files (about 15 000). This is done often and I can't keep my filehandle or cache the result so I need to check the dir from scratch everytime.
The script isn't originally done by me and uses `ls ...` which I figured should be slower than opendir() but it seems as it's not. Opendir beats ls on directories containing a small amount of files but slower on directories with many files. Can anyone explain to me how shelling out can be quicker than opendir and grep?

What I find really confusing is that reading/rewinding an already open directory isn't much faster than open/close. The output i get is:
jmo@foo:~> ls /some/small/dir|wc -l 322 jmo@foo:~> ls /some/large/dir|wc -l 12337 jmo@foo:~> perl ls.pl Benchmark: timing 5000 iterations of allready opendir on small dir, ls + on small dir, opening dir on small dir... allready opendir on small dir: 3 wallclock secs ( 2.83 usr + 0.50 sy +s = 3.33 CPU) @ 1501.50/s (n=5000) ls on small dir: 30 wallclock secs ( 0.60 usr 2.48 sys + 11.90 cusr 1 +6.85 csys = 31.83 CPU) @ 1623.38/s (n=5000) opening dir on small dir: 4 wallclock secs ( 3.31 usr + 0.61 sys = +3.92 CPU) @ 1275.51/s (n=5000) Benchmark: timing 5000 iterations of allready opendir on large dir, ls + on large dir, opening dir on large dir... allready opendir on large dir: 102 wallclock secs (87.82 usr + 13.04 s +ys = 100.86 CPU) @ 49.57/s (n=5000) ls on large dir: 57 wallclock secs ( 0.51 usr 1.99 sys + 27.62 cusr 2 +7.09 csys = 57.21 CPU) @ 2000.00/s (n=5000) opening dir on large dir: 101 wallclock secs (88.11 usr + 12.37 sys = +100.48 CPU) @ 49.76/s (n=5000)
The code I run look like this:
use Benchmark; use strict; my $dir = '/some/small/dir'; my $file = 'somefile*'; opendir (F, $dir); timethese(5000, {'ls on small dir' => \&foo, 'allready opendir on smal +l dir' => \&bar, 'opening dir on small dir' => \&baz}); $dir = '/some/large/dir'; $file = 'somefile*'; print "\n\n\n"; closedir F; opendir (F, $dir); timethese(5000, {'ls on large dir' => \&foo, 'allready opendir on larg +e dir' => \&bar, 'opening dir on large dir' => \&baz}); closedir F; sub foo { my $a = `ls $dir/$file &>/dev/null`; } sub bar { my (@files) = (grep (/$file/, readdir(F))); rewinddir (F); } sub baz { opendir (DIR, $dir); my (@files) = (grep (/$file/, readdir(DIR))); closedir F; }

Replies are listed 'Best First'.
Re: opendir slower than ls on large dirs?
by BrowserUk (Patriarch) on Jul 05, 2005 at 10:40 UTC

    Have you tried using glob?

    It seems to run quicker that grep/regex/, readdir on large dirs on my system when using a wildcard.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
      Glob was the sollution! It took less than 1s with the same tests.

      It's still kinda scary that opendir is so slow.

        It's not that opendir is slow, it's calling readdir, building the list, passing that list through to grep, which starts the regex engine once for every file in the list, building the output list and assigning it to the array.

        With large lists, much of that time would expanding memory to stages (powers of 2?) to accomodate the lists as they grow.

        With glob, everything is done within the C code which is entered and exited once.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
Re: opendir slower than ls on large dirs?
by anonymized user 468275 (Curate) on Jul 05, 2005 at 10:07 UTC
    Wouldn't (-f $dir/$file) be a faster test for the existence of a (ordinary) file in a directory than either of the other methods?

    To explain the comparative behaviour roughly: shelling out has a fixed overhead; the ls command will have been optimised for the platform. opendir/readdir on the other hand has variable overheads, is probably not optimised for a fixed platform, so will begin cheaper but grow with the filecount to eventually overtake the fixed overhead of shelling out.

    One world, one people

      it is by gazillions faster than the things I tried, but i need a wildcard and -f doesn't support that afaik.

      Thanks for the explaination of why ls is quicker on large dirs tho.
Re: opendir slower than ls on large dirs?
by merlyn (Sage) on Jul 05, 2005 at 13:58 UTC
    my $file = 'somefile*'; ... my (@files) = (grep (/$file/, readdir(F)));
    I'm not exactly sure how it flaws your benchmark, but that's definitely not going to do what you intend it to do. You're looking for any filename that contains "somefil" followed by zero or more letter e. Do you want that, or do you instead want: /^somefile/?

    -- Randal L. Schwartz, Perl hacker
    Be sure to read my standard disclaimer if this is a reply.

      heh, you're right in that i want /^somefile/ not somefile*. Is glob the fastest way to find files in a directory? The opendir, grep ... readdir did I take from "perldoc -f readdir" and it seemed like a nice sollution, but I'm skipping it for glob. What I was trying to figure out in the first place was "what's the cheapest way of listing a bunch of files in a directory" and it seems to me like glob is the way to go. Thanks everyone for all the answers
Re: opendir slower than ls on large dirs?
by neniro (Priest) on Jul 05, 2005 at 09:48 UTC
    In case of foo() you just slurp the output of ls into a single variable, in both other cases you get arrays as result. You compare apples to pears.
      switching @files for $files doesn't change speed.

      Also, opendir with or without pre-open is 6-10 times quicker on a small dir but twice as slow on a large dir.

        Er, no because an intermediate list is still being built. I think the fairer comparison would be to assign to an array in the foo subroutine.

        Of course a readdir and a grep is not at all equivalent to ls <file> which will immediately see that $file exists (or not) and will not do any further searching, in the pure perl code you are going through every directory entry irrespective of whether you have seen the file or not. You will see somewhat similar performance hit if you did `ls $dir | grep $file`

        /J\

Re: opendir slower than ls on large dirs?
by Anonymous Monk on Jul 05, 2005 at 13:10 UTC
    It's the grep using a regex that's slowing you down. You would expect that, for a large directory, and assuming what you're looking for isn't cached, the bottleneck is going to be the disk overhead, and whether you are going to call ls or not is going to be minimal. However, in your readdir solutions you are going to do work in Perl space for each file returned, due to your grep. grep $_ eq $file would be better than grep /$file/, but you're still doing Perl work for each file returned.

    If all you want to know whether a certain file exists, use -e. That's going to be the fastest, and will tell you exactly that. On Unix operating systems, -e on a large directory will still be slower than doing it on a small directory, and that's due to Unix decision to store filenames unsorted in a directory (storing file sorted makes operations in a large directory faster, but then those operations would be slower in a small directory). Now I was a bad boy and generalized Unix - which is not a smart thing to do, because Unix means a gazillion ways of doing the same thing, every way slightly different than the other, so no doubt there are a few file systems out there who do store files differently. I think Windows directories store files unsorted as well, but I could be mistaken.

    Lesson to be learned: do not create large directories! (Directories are like drawers: the more stuff you have in it, the harder it is to find something).

Re: opendir slower than ls on large dirs?
by TedPride (Priest) on Jul 05, 2005 at 12:30 UTC
    I'm trying to check if a file is in a directory with a lot of files (about 15 000).

    Can't you just do this?

    if (-e $filepath) { ... }

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://472398]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (4)
As of 2024-04-20 13:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found