Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re: Automatically distributing and finding files in subdirectories

by rodion (Chaplain)
on Jul 18, 2006 at 02:05 UTC ( [id://561907]=note: print w/replies, xml ) Need Help??


in reply to Automatically distributing and finding files in subdirectories

This is a case where it's a good idea to use the shell, which is well optimized at handling directories quickly, and has little overhead if you only invoke it once. The shell command "ls -1" can give you a list of files to work with, (or "ls -1t" if you want them in chronological order, as you probably know). You can do this outside perl, or inside perl with a piped file open or with a system() call.

Then you process the text output of ls with perl, which is good at handling text, and also has the tools to make directories and move files.

Here's an example which moves the files to sub-directories, putting a fixed number of files (100) in each sub-directory.

my $path = '/var/local/path/to/files'; my $subdir = 'lower0'; my $file_cnt = 0; my $name; my @files = system "find $path" ==0 or die "system ls failed with $?"; while ($name = shift @files) { chomp $name; mkdir $path.'/'.$subdir++ if $file_cnt++ %100; rename "$path/$name","$path/$subdir/$name"; } # untested (I'm not near a unix box)
This is pretty transparent in my book (as you requested). (Assuming you're familliar with the perl increment magic used in the "$subdir++". If not, it's easy to split that into a sting and a $dir_cnt++, and concatenate them.)

Update: Changed 'system "ls -l $path"' to 'system "find $path"', thanks to shmem's reminder of the lstat overhead in "ls".

Now that I have access Linux and BSDI boxes again, I did some timeings on each. 'find' is about the same speed as opendir(). 'ls -1' takes 60-100% longer, depending on the system.

I prefer the 'system "find $path"' version for clarity, but I was totally wrong about the speed advantages. Thanks to graff and shmem for catching what I forgot.

  • Comment on Re: Automatically distributing and finding files in subdirectories
  • Download Code

Replies are listed 'Best First'.
Re^2: Automatically distributing and finding files in subdirectories
by graff (Chancellor) on Jul 18, 2006 at 03:13 UTC
    This is a case where it's a good idea to use the shell, which is well optimized...

    Not really. I've done a simple benchmark on a pretty big directory, and running a perl one-liner with opendir/readdir compares favorably to running "ls".

    In addition, using perl's built-in "rename" function on a long list of files in order to relocate them within a directory tree will certainly be faster that running "mv" repeatedly in any kind of script, because with perl, you have one process making lots of calls to an OS-level library function, whereas each "mv" command is a separate process being started and shut down.

    Here's my simple-minded benchmark for two ways of listing files in a directory, on a path I happened to find in my local nfs space with over 100 K files:

    $ time ls /long/path/to/bigdir | wc -l 101645 real 0m0.463s user 0m0.421s sys 0m0.078s $ time perl -e 'opendir(D,"/long/path/to/bigdir"); @f=readdir D; print "$_\n" for (@f)' | wc -l 101647 real 0m0.156s user 0m0.124s sys 0m0.070s
    (The two extra files found by perl are "." and "..", of course. When I change it to @f=grep/\w/,readdir D to get the same list as "ls", the perl one-liner still shows up as faster. Go figure.)

    I ran those a few times each, varying the order of the two command lines to eliminate OS caching as a factor, and the results shown above were consistent. This was actually kind of surprising to me, because I always thought that basic "ls" on a directory was pretty quick ("ls -l" is a lot slower, and there's no point using the "-l" option in the OP's case).

    Note that the results would be seriously reversed if the perl script used File::Find instead of opendir/readdir.

    In terms of ease and flexibility of scripting, ease of reading the resulting script, and overall speed of execution, I believe pure Perl wins over the shell for cases like this.

    UPDATE: Thanks to a comment from Hue-Bond, I see now that there was a flaw in my benchmark: "ls" sorts its output by default. If I run it as "ls -U" to skip the sort, it wins easily over the perl one-liner. It's interesting that if I add "sort" to the perl script, to make it match the default "ls" output, it's still faster than "ls".

Re^2: Automatically distributing and finding files in subdirectories
by Fletch (Bishop) on Jul 18, 2006 at 02:32 UTC

    ls is going to be calling the same underlying system calls that perl's builtin opendir and readdir are going to call. You're actually going to incur more overhead in the fork-and-exec than just calling them directly.

Re^2: Automatically distributing and finding files in subdirectories
by shmem (Chancellor) on Jul 18, 2006 at 07:42 UTC
    The big problem with linux and directory listing is the expensive lstat call done by ls. I had a case where listing took minutes, the directory contained somewhat over 300.000 files.

    The cheapest way to get at the files at the system level is find $dir which doesn't call lstat.

    --shmem

    _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                  /\_¯/(q    /
    ----------------------------  \__(m.====·.(_("always off the crowd"))."·
    ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
      Dohhh! Thanks for bringing up the lstat in ls. I'll update the previous post so no one gets mis-lead.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://561907]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (2)
As of 2024-04-20 03:04 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found