I don't know of a module that does it and didn't see one in a quick search at search.cpan.org, but it's not that hard to split them up manually.
This fragment moves a file in the current directory to a two-level subdirectory based on the first and first-and-second letters. E.g.: README.txt goes to R/RE/README.txt.
use strict;
use warnings;
use File::Spec;
use File::Path qw/mkpath/;
my $file = shift;
my $first = substr( $file, 0, 1 );
my $second = substr( $file, 0, 2 );
my $path = File::Spec->catdir( $first, $second );
mkpath( $path );
rename $file, File::Spec->catfile( $path, $file )
or die "Can't file $file\: $!";
-xdg
Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.
| [reply] [d/l] |
When we hit a similar problem at work what I did was implement a module that took a base directory name and created two layers of 00..ff subdirectories (so 00/00, 00/01, ..., ff/fe, ff/ff). It then had a function that would take a "path" and return a real OS path for that file generated by running the filename through Digest::MD5 and splitting off the first pairs of hex digits. So given the base directory /tmp/hashed, the file fooble would get located in /tmp/hashed/03/63/03638a39d7858a61a982a1f21b33c215.
All of your calls to open should pass through the hashing function, or you can make a hashed_open that you call instead. If that still leaves too many files in any one subdirectory (although it should split them so there's only 1/64k files in any one directory) you can add another layer of subdirs.
| [reply] |
If I understand your process correctly, wouldn't there be at least the slightest little worry that two different (original) file names would generate the same MD5 hash?
I suppose that if you just make a list of the file names and their md5 sigs first, you could spot collisions before actually moving stuff into the new directory structure. But if you have to add files to the structure over time, you need to check for the existence of a given md5 "path/name" before storing a new file there (and then figure out a proper way to avoid collisions while maintaining correct mappings between original and hashed names).
| [reply] |
Correct. If that 1 in 2128 possibility bothers you there's always Digest::SHA1 for 1 in 2160. Or keep a DBM mapping of "path" to hash and check for collisions when adding a new "path" entry.
Update: Left out the chance of collision for SHA1.
| [reply] |
This is a case where it's a good idea to use the shell, which is well optimized at handling directories quickly, and has little overhead if you only invoke it once. The shell command "ls -1" can give you a list of files to work with, (or "ls -1t" if you want them in chronological order, as you probably know). You can do this outside perl, or inside perl with a piped file open or with a system() call.
Then you process the text output of ls with perl, which is good at handling text, and also has the tools to make directories and move files.
Here's an example which moves the files to sub-directories, putting a fixed number of files (100) in each sub-directory.
my $path = '/var/local/path/to/files';
my $subdir = 'lower0';
my $file_cnt = 0;
my $name;
my @files = system "find $path"
==0 or die "system ls failed with $?";
while ($name = shift @files) {
chomp $name;
mkdir $path.'/'.$subdir++ if $file_cnt++ %100;
rename "$path/$name","$path/$subdir/$name";
}
# untested (I'm not near a unix box)
This is pretty transparent in my book (as you requested). (Assuming you're familliar with the perl increment magic used in the "$subdir++". If not, it's easy to split that into a sting and a $dir_cnt++, and concatenate them.)
Update: Changed 'system "ls -l $path"' to 'system "find $path"', thanks to shmem's reminder of the lstat overhead in "ls".
Now that I have access Linux and BSDI boxes again, I did some timeings on each. 'find' is about the same speed as opendir(). 'ls -1' takes 60-100% longer, depending on the system.
I prefer the 'system "find $path"' version for clarity, but I was totally wrong about the speed advantages. Thanks to graff and shmem for catching what I forgot. | [reply] [d/l] |
This is a case where it's a good idea to use the shell, which is well optimized...
Not really. I've done a simple benchmark on a pretty big directory, and running a perl one-liner with opendir/readdir compares favorably to running "ls".
In addition, using perl's built-in "rename" function on a long list of files in order to relocate them within a directory tree will certainly be faster that running "mv" repeatedly in any kind of script, because with perl, you have one process making lots of calls to an OS-level library function, whereas each "mv" command is a separate process being started and shut down.
Here's my simple-minded benchmark for two ways of listing files in a directory, on a path I happened to find in my local nfs space with over 100 K files:
$ time ls /long/path/to/bigdir | wc -l
101645
real 0m0.463s
user 0m0.421s
sys 0m0.078s
$ time perl -e 'opendir(D,"/long/path/to/bigdir");
@f=readdir D; print "$_\n" for (@f)' | wc -l
101647
real 0m0.156s
user 0m0.124s
sys 0m0.070s
(The two extra files found by perl are "." and "..", of course. When I change it to @f=grep/\w/,readdir D to get the same list as "ls", the perl one-liner still shows up as faster. Go figure.)
I ran those a few times each, varying the order of the two command lines to eliminate OS caching as a factor, and the results shown above were consistent. This was actually kind of surprising to me, because I always thought that basic "ls" on a directory was pretty quick ("ls -l" is a lot slower, and there's no point using the "-l" option in the OP's case).
Note that the results would be seriously reversed if the perl script used File::Find instead of opendir/readdir.
In terms of ease and flexibility of scripting, ease of reading the resulting script, and overall speed of execution, I believe pure Perl wins over the shell for cases like this.
UPDATE: Thanks to a comment from Hue-Bond, I see now that there was a flaw in my benchmark: "ls" sorts its output by default. If I run it as "ls -U" to skip the sort, it wins easily over the perl one-liner. It's interesting that if I add "sort" to the perl script, to make it match the default "ls" output, it's still faster than "ls". | [reply] [d/l] [select] |
| [reply] |
| [reply] [d/l] [select] |
Dohhh! Thanks for bringing up the lstat in ls. I'll update the previous post so no one gets mis-lead.
| [reply] |