Managing a directory with millions of files

Here's a piece of kit from my personal toolbox. I'm a Unix sysadmin and I sometimes run into directories with millions of files - such as mail spools choking with spam. The system tools will regularly choke on these. I've used these tools with great success. At one site I used these to clean up spam-infested mail queues when the built-in tools of a major commercial MTA weren't fast enough.

Sometimes the only thing of interest is the file count:

use strict;
use warnings;
my $dir = shift;
die "Usage: $0 directory" unless defined $dir;
opendir DIR, "$dir" or die "Could not open $dir: $!\n";
my @files=readdir DIR;
print $#files+1, "\n";
closedir DIR;
[download]

Sometimes you need to search for files with some criteria and do something to them. Here's a script that searches for files based on name given a directory and regexp:

use strict;
use warnings;
my $dir = shift;
my $criteria = shift;
$criteria = "" unless defined $criteria;
die "Usage: $0 directory" unless defined $dir;
opendir DIR, "$dir" or die "Could not open $dir: $!\n";
my @files=grep(/$criteria/, readdir DIR);
print $#files+1, " files\n";
chdir $dir;
foreach my $file (@files) {
  # actions go here
}
closedir DIR;
[download]

The logic here is that this only picks the relevant files from the directory array, which is a reasonably fast operation even with a million-row array, and only touches the files in the result set. So if you have a directory with a million files, and there's one file you want to know about, you don't need to run through the other files going "no, that's not it..."

You'll notice a certain toollike air about these scripts, as if they were covered in grease, scratched and dinged from being thrown about in a toolbox. This is exactly the case. These get copied or rewritten every so often into random systems that don't necessarily have any CPAN modules or any practical way to install them in any reasonable amount of time. That's why the Spartan interface and simple structure. I've written a fancier version with options, lots of features, and nicely formatted output, but that's apparently left that behind on a previous employer's server. Dang.

Comment on Managing a directory with millions of files Select or Download Code

Replies are listed 'Best First'.
Re: Managing a directory with millions of files by ruzam (Curate) on Jan 28, 2008 at 00:39 UTC
Filling an array with millions of files just to get a total count doesn't seem very resource friendly to me. Maybe something more like: `use strict; use warnings; my $dir = shift; die "Usage: $0 directory" unless defined $dir; opendir DIR, "$dir" or die "Could not open $dir: $!\n"; my $count = 0; $count++ while defined readdir DIR; closedir DIR; print "$count\n";` [download] A good tool shouldn't put a strain on low memory environments.	[reply] [d/l]
Re^2: Managing a directory with millions of files by markhu (Initiate) on Mar 01, 2008 at 19:10 UTC
touche', but in defense of the OP, his use of a built-in perl slurpish function in the first algorithm is the foundation upon which the second is built. And he implied he'd hand-typed these in a few times, so the smaller/simpler the better, even if only by one line/several keywords. And finally, he didn't say the resource being taxed was the RAM, but the filesystem. The disk subsystem generally being the weak point in modern servers with multi-gigabyte RAM and fast CPU(s).	[reply]
Re: Managing a directory with millions of files by KurtSchwind (Chaplain) on Jan 28, 2008 at 14:22 UTC
I can see some real usefulness out of the 2nd script, but not so much from the first script. I guess I think that perl isn't the right tool to get a file count in a dir. `ls /dir/path \|wc -l` [download] Will accomplish that. I also use `find` a lot in those situations. In the 2nd application you talk about, you get a far more robust regex handler than find can do by itself, so that's nice. -- I used to drive a Heisenbergmobile, but every time I looked at the speedometer, I got lost.	[reply] [d/l] [select]
Re^2: Managing a directory with millions of files by mpeg4codec (Pilgrim) on Jan 29, 2008 at 05:45 UTC
The trouble with your approach is in the title: millions of files. On most Linux systems I've used, ls will allocate some amount of memory for each file (presumably so it can sort the listing). Perl could be used as a lightweight wrapper for the opendir library function. However, as ruzam pointed out, the first script isn't much better than ls in that regard. His approach is O(1) in the number of files, a variation of which I've personally used in situations similar to jsiren's.	[reply]
Re^3: Managing a directory with millions of files by KurtSchwind (Chaplain) on Jan 29, 2008 at 13:21 UTC
Hrm. That's an interesting hypothesis. Essentially you are saying that the overhead of calling perl and the opendir library is less than the overhead of ls+sort for a large number of files. I'm not sure I'm buying that. I'm not sure how to test the memory usage on it though. Give me a day or two and I might just benchmark the memory usage between the two techniques. You could be right, but my gut says no. The inode table is pretty efficient at this kind of thing. -- I used to drive a Heisenbergmobile, but every time I looked at the speedometer, I got lost.	[reply]
Re^4: Managing a directory with millions of files by mpeg4codec (Pilgrim) on Jan 29, 2008 at 18:06 UTC
Re^5: Managing a directory with millions of files by KurtSchwind (Chaplain) on Jan 29, 2008 at 18:31 UTC
Re^5: Managing a directory with millions of files by mr_mischief (Monsignor) on Jan 30, 2008 at 21:11 UTC
Re: Managing a directory with millions of files by jthalhammer (Friar) on Jan 29, 2008 at 05:24 UTC
Take a peek at ack. For example: `$> ack -ag file_name_regex some_directory` [download] This will recursively find all files beneath some_directory/ where the full path =~ /file_name_regex/.	[reply] [d/l]
Re: Managing a directory with millions of files by ohcamacj (Beadle) on Jan 30, 2008 at 10:52 UTC
I've found ls -f ( -f disables sorting and highlighting) to be much faster then the default ls options under some conditions. Some linux distributions put 'alias ls="ls --some --options"' in /etc/profile (or equivalent), which can cause ls to stat() each file to determine the correct highlighting options before displaying anything.	[reply]