leocharre has asked for the wisdom of the Perl Monks concerning the following question:

I have a directory with 100,000 file entries. I want to pull out 5 filenames in no particular order. What's a quick way to do this?

One thing I can think of is using readdir in non list context. By comparison, reading a directory in list context can take a long time (a few seconds).
Maybe something like..

my $abs_d = './'; my @f; opendir(DIR,$abs_d) or die; for my $filename ( readdir(DIR) ){ next if $filename=~/^\.+$/; push @f, $filename; last if $#f == 4; }
Maybe there's some other interesting way to do this?

Replies are listed 'Best First'.
Re: quick way to read a few directory entries of many
by ikegami (Patriarch) on Jun 05, 2008 at 14:40 UTC
    That snippet uses readdir in list context. You'd want
    my $dn = '.'; my @f; opendir(my $dh, $dn) or die; while ( @f < 5 && defined( my $filename = readdir($dh) ) ){ next if $filename=~/^\.\.?\z/; push @f, $filename; }

    Other changes:

    • I renamed $abs_d cause it didn't contain an absolute path.
    • I used @f<5 rather than $#f<4 cause there's no need to hide the actual number of interest.
    • I used a lexical instead of a global variable for the handle. Needless use of globals is bad.
    • Your check for "." and ".." was flawed. It could skip legitimate file names.

    Note that you'll quite likely get the same files in the same order every time. But that's how it's going to be unless you read the whole dir.

Re: quick way to read a few directory entries of many
by swampyankee (Parson) on Jun 05, 2008 at 17:16 UTC

    With 105 files in one directory, my first though would be to reorganize the directory structure. My second thought would be "what's special about the first five files returned by readdir that you need to look at their names?

    I think that explaining why you need to do this would be useful, as it may give some of the very knowledgeable Monks a better way to look at your problem.


    Information about American English usage here and here. Floating point issues? Please read this before posting. — emc

      I have an 'incoming' directory that may or may not contain files that have to be 'processed'.
      There are off an on a ton of files, sometimes indeed in the tens of thousands.

      The procedure that each file undergoes may be expensive- thus I have a daemon sort of thing.. that will run x times during the day and maybe a lot during the night.. or maybe if it detects that the cpu has been "idle" for x minutes.

      So I take a few files, maybe ten, and do something with them, sleep or check for cpu usage.. then iterate.
      My frustration is that sometimes it takes a third of the time per iteration to pick some files.

      I am aware that I can cache the directory read data, etc etc .. I am not seeking a way to change what I am doing, I am seeking to .. pick some files out of many- quickly. I figure it's something that would be worth setting precedent to - for the future.

      Maybe you are suggesting I could pipe in the file data directly from pointers to the dir struct or something funny like that? (ext3)

        Maybe you could somehow take advantage of Linux's inotify mechanism (and Linux::Inotify2) — if you're on Linux, that is, of course... but your mention of ext3 sounds like you might be.  In other words, you could scan the entire directory once, and then update the resulting data structure as you receive events about individual files being added, removed, etc.  Or something like that.  Just an idea...

        Depends on your definition of 'processed'. I've had similar jobs, but the files only needed to be processed once. More accurately, each instance of a file only needed to be processed once, so I always mv the 'done' files to an arc dir and gzip them as I process them.
Re: quick way to read a few directory entries of many
by Fletch (Bishop) on Jun 05, 2008 at 14:54 UTC

    Depending on the underlying filesystem there may not be anything you can do to speed it up. For example on Linux ext2 is going to start to chug on directories over around a thousand entries (at least rule of thumb you'll start to feel the performance hit). If this is the case you'd be better served reorganizing contents if you can so that the directory structure itself does the filtering itself by virtue of how you've organized things.

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

Re: quick way to read a few directory entries of many
by kgraff (Monk) on Jun 05, 2008 at 16:55 UTC

    How often do the files in the directory change?

    If it is not very often, you could run a cron job to put the list of files or a subset of the list of files into a separate file, then obtain your 5 file names from that.

      This is a very good hack. Using Cache::File would be useful with that.

      Still there could be a few lines of code that could help speed this up, live.

      In my case, the files change often. I guess some caching is in order- or some kind of shared memory thing- a singleton, who knows- that would hold a list readily altered.. hmm.. Still looking for a small hack though..