Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Automatically distributing and finding files in subdirectories

by myuserid7 (Scribe)
on Jul 18, 2006 at 00:05 UTC ( #561893=perlquestion: print w/replies, xml ) Need Help??

myuserid7 has asked for the wisdom of the Perl Monks concerning the following question:

I made a bad architecture choice and have been cramming tens of thousands of files into one directory. FreeBSD was fine, but the unintended host platform of Linux/NFS is completely locking when I traverse the files with perl.

I'm assuming I'll need to split into directories based on some criteria (probably the last few digits of each file).

My question: what is the most transparent way to do this? I don't suppose there is any module out there that creates directories/distributes files while sitting under perl's native functions?

  • Comment on Automatically distributing and finding files in subdirectories

Replies are listed 'Best First'.
Re: Automatically distributing and finding files in subdirectories
by xdg (Monsignor) on Jul 18, 2006 at 01:25 UTC

    I don't know of a module that does it and didn't see one in a quick search at search.cpan.org, but it's not that hard to split them up manually.

    This fragment moves a file in the current directory to a two-level subdirectory based on the first and first-and-second letters. E.g.: README.txt goes to R/RE/README.txt.

    use strict; use warnings; use File::Spec; use File::Path qw/mkpath/; my $file = shift; my $first = substr( $file, 0, 1 ); my $second = substr( $file, 0, 2 ); my $path = File::Spec->catdir( $first, $second ); mkpath( $path ); rename $file, File::Spec->catfile( $path, $file ) or die "Can't file $file\: $!";

    -xdg

    Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.

Re: Automatically distributing and finding files in subdirectories
by Fletch (Bishop) on Jul 18, 2006 at 02:40 UTC

    When we hit a similar problem at work what I did was implement a module that took a base directory name and created two layers of 00..ff subdirectories (so 00/00, 00/01, ..., ff/fe, ff/ff). It then had a function that would take a "path" and return a real OS path for that file generated by running the filename through Digest::MD5 and splitting off the first pairs of hex digits. So given the base directory /tmp/hashed, the file fooble would get located in /tmp/hashed/03/63/03638a39d7858a61a982a1f21b33c215.

    All of your calls to open should pass through the hashing function, or you can make a hashed_open that you call instead. If that still leaves too many files in any one subdirectory (although it should split them so there's only 1/64k files in any one directory) you can add another layer of subdirs.

      If I understand your process correctly, wouldn't there be at least the slightest little worry that two different (original) file names would generate the same MD5 hash?

      I suppose that if you just make a list of the file names and their md5 sigs first, you could spot collisions before actually moving stuff into the new directory structure. But if you have to add files to the structure over time, you need to check for the existence of a given md5 "path/name" before storing a new file there (and then figure out a proper way to avoid collisions while maintaining correct mappings between original and hashed names).

        Correct. If that 1 in 2128 possibility bothers you there's always Digest::SHA1 for 1 in 2160. Or keep a DBM mapping of "path" to hash and check for collisions when adding a new "path" entry.

        Update: Left out the chance of collision for SHA1.

Re: Automatically distributing and finding files in subdirectories
by rodion (Chaplain) on Jul 18, 2006 at 02:05 UTC
    This is a case where it's a good idea to use the shell, which is well optimized at handling directories quickly, and has little overhead if you only invoke it once. The shell command "ls -1" can give you a list of files to work with, (or "ls -1t" if you want them in chronological order, as you probably know). You can do this outside perl, or inside perl with a piped file open or with a system() call.

    Then you process the text output of ls with perl, which is good at handling text, and also has the tools to make directories and move files.

    Here's an example which moves the files to sub-directories, putting a fixed number of files (100) in each sub-directory.

    my $path = '/var/local/path/to/files'; my $subdir = 'lower0'; my $file_cnt = 0; my $name; my @files = system "find $path" ==0 or die "system ls failed with $?"; while ($name = shift @files) { chomp $name; mkdir $path.'/'.$subdir++ if $file_cnt++ %100; rename "$path/$name","$path/$subdir/$name"; } # untested (I'm not near a unix box)
    This is pretty transparent in my book (as you requested). (Assuming you're familliar with the perl increment magic used in the "$subdir++". If not, it's easy to split that into a sting and a $dir_cnt++, and concatenate them.)

    Update: Changed 'system "ls -l $path"' to 'system "find $path"', thanks to shmem's reminder of the lstat overhead in "ls".

    Now that I have access Linux and BSDI boxes again, I did some timeings on each. 'find' is about the same speed as opendir(). 'ls -1' takes 60-100% longer, depending on the system.

    I prefer the 'system "find $path"' version for clarity, but I was totally wrong about the speed advantages. Thanks to graff and shmem for catching what I forgot.

      This is a case where it's a good idea to use the shell, which is well optimized...

      Not really. I've done a simple benchmark on a pretty big directory, and running a perl one-liner with opendir/readdir compares favorably to running "ls".

      In addition, using perl's built-in "rename" function on a long list of files in order to relocate them within a directory tree will certainly be faster that running "mv" repeatedly in any kind of script, because with perl, you have one process making lots of calls to an OS-level library function, whereas each "mv" command is a separate process being started and shut down.

      Here's my simple-minded benchmark for two ways of listing files in a directory, on a path I happened to find in my local nfs space with over 100 K files:

      $ time ls /long/path/to/bigdir | wc -l 101645 real 0m0.463s user 0m0.421s sys 0m0.078s $ time perl -e 'opendir(D,"/long/path/to/bigdir"); @f=readdir D; print "$_\n" for (@f)' | wc -l 101647 real 0m0.156s user 0m0.124s sys 0m0.070s
      (The two extra files found by perl are "." and "..", of course. When I change it to @f=grep/\w/,readdir D to get the same list as "ls", the perl one-liner still shows up as faster. Go figure.)

      I ran those a few times each, varying the order of the two command lines to eliminate OS caching as a factor, and the results shown above were consistent. This was actually kind of surprising to me, because I always thought that basic "ls" on a directory was pretty quick ("ls -l" is a lot slower, and there's no point using the "-l" option in the OP's case).

      Note that the results would be seriously reversed if the perl script used File::Find instead of opendir/readdir.

      In terms of ease and flexibility of scripting, ease of reading the resulting script, and overall speed of execution, I believe pure Perl wins over the shell for cases like this.

      UPDATE: Thanks to a comment from Hue-Bond, I see now that there was a flaw in my benchmark: "ls" sorts its output by default. If I run it as "ls -U" to skip the sort, it wins easily over the perl one-liner. It's interesting that if I add "sort" to the perl script, to make it match the default "ls" output, it's still faster than "ls".

      ls is going to be calling the same underlying system calls that perl's builtin opendir and readdir are going to call. You're actually going to incur more overhead in the fork-and-exec than just calling them directly.

      The big problem with linux and directory listing is the expensive lstat call done by ls. I had a case where listing took minutes, the directory contained somewhat over 300.000 files.

      The cheapest way to get at the files at the system level is find $dir which doesn't call lstat.

      --shmem

      _($_=" "x(1<<5)."?\n".q/)Oo.  G\        /
                                    /\_/(q    /
      ----------------------------  \__(m.====.(_("always off the crowd"))."
      ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
        Dohhh! Thanks for bringing up the lstat in ls. I'll update the previous post so no one gets mis-lead.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://561893]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (6)
As of 2023-06-05 10:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    How often do you go to conferences?






    Results (24 votes). Check out past polls.

    Notices?