perlpreben has asked for the wisdom of the Perl Monks concerning the following question:

Heyah!

I need to get the folder names on my NAS (not the files inside). But the problem is that I have well over 20 000+++ folders, and a simple LS is taking too long (hours upon hours)

Ived tried parsing syste("ls"), tried File::Find::name ... but they are all too slow. Does the experienced munks ever takled this sort of problem before?

/path_to_nas/data/<20000+++_folders_in_here>/<few files>

Replies are listed 'Best First'.
Re: Get folders in a directory with 20k directories?
by toolic (Bishop) on Aug 25, 2011 at 12:26 UTC
    There may be an issue with your file system. I just created a directory with 20,000 subdirectories. Then I used the following code to get the name of all directories. It took no (noticeable) time to run:
    use warnings; use strict; use autodie; my $dir = '/tmp'; opendir my ($dh), $dir; my @dirs = grep { -d "$dir/$_" } readdir $dh; print scalar(@dirs), "\n";
Re: Get folders in a directory with 20k directories?
by moritz (Cardinal) on Aug 25, 2011 at 12:34 UTC
    The traditional Unix solution to slow directory listings over slow connections is to execute ls -l on the server, and compress the result:
    ls -l > gzip | ls-l.gz
    ls -l | gzip > ls-l.gz

    Then you just need to unzip and open that file. Of course that only works if you have shell access to the NAS, and if the directory listing doesn't change very often.

    (Updated to fix command line, hbm++)

      Shouldn't that be the following?

      ls -l |gzip > ls-l.gz
Re: Get folders in a directory with 20k directories?
by RMGir (Prior) on Aug 25, 2011 at 12:41 UTC
    Depending on how far over 20000 files you are, this article may be relevant - it explains how readdir can fail for VERY large directories, and how you can get around it.

    Mike
Re: Get folders in a directory with 20k directories?
by blue_cowdawg (Monsignor) on Aug 25, 2011 at 12:53 UTC
        I need to get the folder names on my NAS (not the files inside).

    First off, define "slow."

    Secondly there are a lot of factors here that could affect your timing trying to do what you are doing. My immediate solution in Perl would be something along the lines of:

    | | stuff before opendir(DIR,"/some/path/where/the/folders/live") or die $!; my @folders=(); while(my $entry = readdir(DIR)){ next if ( $entry eq "." ) || ($entry eq ".." ); next unless -d $entry; push @folders,$entry; } | rest of code # untested software, do not use to aim phaser banks.

    But wait, if you are having performance issues using the "ls" command then we have to consider one or more of the following factors:

    • Network latency issues to/from NAS
    • NAS device architecture
    • Filesystem type and efficiencies
    It would seem to me that if "ls" is having an issue with performance acting on that collection of folders then anything you might use to emulate "ls" is going to have issues as well.

    In my mind this is less of a Perl or scripting issue and more of a system issue.


    Peter L. Berghold -- Unix Professional
    Peter -at- Berghold -dot- Net; AOL IM redcowdawg Yahoo IM: blue_cowdawg
Re: Get folders in a directory with 20k directories?
by norbert.csongradi (Beadle) on Aug 25, 2011 at 12:45 UTC
    Could be problems with automount/AFS volumes/slow attribute access.

    Does "echo *" run fast in this dir?

Re: Get folders in a directory with 20k directories?
by osbosb (Monk) on Aug 25, 2011 at 14:27 UTC
    Here's some recursion leveraging a sub.
    #!/usr/bin/perl use strict; use warnings; my $dir = "/path_to_nas/data/"; my @directories; find_dirs($dir); sub find_dirs { my $dir = $_[0]; for my $worker (glob("$dir*")) { if (-d $worker) { push $worker, @directories; find_dirs("$worker/"); } else { # don't care.. standard file. } } } print scalar(@directories), "\n";
Re: Get folders in a directory with 20k directories?
by locked_user sundialsvc4 (Abbot) on Aug 25, 2011 at 12:55 UTC

    I agree that something does not “smell right” about this.   Is there a way to monitor the network traffic that is passing between the machines, and/or to observe the behavior of the processes on other machines that are responsible for producing and/or for delivering the list?

    If such a directory structure was known to perform that egregiously, no one in their right mind would have designed and built such a thing.   Ergo, it didn’t.   Ergo, something else must be wrong (too) ... something unrelated to the file/directory counts.

      Not entirely true.

      If such a directory structure was known to perform that egregiously, no one in their right mind would have designed and built such a thing. Ergo, it didn’t. Ergo, something else must be wrong (too) ... something unrelated to the file/directory counts.

      The classic Unix file system would store the directory entries in a chained list of blocks. If you were doing a long listing (ls -l, for example), you would also need to access the main inode for each file to pull file permissions, ownership, etc. This would cause degraded performance if you had a large, flat directory structure. Also, ls tends to sort its output.

      A common solution to this problem is to make your directory more tree-like by hashing the file list into a certain number of buckets (N), where N is no larger than the number of directory entries that can fit into a single block in the directory entry on disk. Create as many levels of directories as you need. For example, if you have X files (20K in this case), and N directory entries (10 for this example) can fit in a single block (it is actually larger than this, but let's keep the math easy), you would need a structure of depth ceil(log(X) / log(N)) (in this case, 5). Qmail used this format for its queues.

      --MidLifeXis

Re: Get folders in a directory with 20k directories?
by i5513 (Pilgrim) on Aug 25, 2011 at 14:46 UTC

    try:

    ls -f

    Does that make the difference ?

    I heard about, but I never test if that was true.

    Another try could be

    find . -type d

    And redirect output to a file, because consoles are slower than redirects!

Re: Get folders in a directory with 20k directories?
by onelesd (Pilgrim) on Aug 25, 2011 at 18:46 UTC

    Use find. Then you can do something else with the directories using xargs and get the benefit of multiple threads and parallel processing. Just have perlscript.pl process @ARGV.

    find /root/directory -type d | xargs perlscript.pl