Re (tilly) 1: I'm falling asleep here
by tilly (Archbishop) on Oct 21, 2001 at 05:02 UTC
|
Welcome to a real world demonstration of why algorithm efficiency matters. :-)
The likely reason why stat is so slow is that each stat has to traverse the entire directory linearly to find the file that you are interested in. That means that you are going back to the directory structure 10,000 times, each time scanning on average about 5,000 entries to find the file of interest. The resulting 50,000,000 fetches of information is what is taking the time.
You can verify this by printing every 100'th filename. You should see a progressive slowdown as the stats slow down due to having to scan more and more previous files.
You would speed this up by a factor of 10 if you arranged to have 10 directories of 1000 files each. You would speed up by a much larger factor with a filesystem which was designed to handle directories with many small files. (Think ReiserFS on Linux.)
The ideal answer, of course, would be to have your direct pass through the directory structure pull back not just the name, but also the associated metadata. That is what ls and dir do, and it is why they are so much faster. Unfortunately Perl's API doesn't give you direct access to that information.
An incidental note. Using Perl does not guarantee portable code. For instance your use of lc will cause you porting problems on operating systems with case-sensitive filesystems (like Unix). (You will be statting a different file than you saw, in fact likely one that isn't there.) | [reply] |
|
$ truss ls -l |& grep stat
...
lstat("Makefile",0x809d24c) = 0 (0x0)
lstat("cmp.c",0x809d348) = 0 (0x0)
lstat("extern.h",0x809d44c) = 0 (0x0)
lstat("ls.1",0x809d548) = 0 (0x0)
...
The output will look very similar on other OSen as well.
Depending on the implementation of ls it's either
calling opendir and stat itself, or it's
using a library (fts(3) on FreeBSD) to traverse
things. But underneath they're all doing the same thing and
making the same system calls.
You are correct that many Unix filesystems don't cope well
with large numbers of entries since it is a linear scan to
locate something within a given directory (more than a few
hundred may show performance problems, something over 1k
will more than likely do so).
A solution to this (if you can make changes to the way
things are structured on disk but can't change to a fancier
filesystem) is to use the
initial characters of each entry as a key and have a second
level hierarchy of subdirs for each key. For example if
your files are named with hexadecimal numbers, you might
have a top level directory with dirs `00' and `01' through
`fe', `ff'. A file named `ace0beef' would be located in
`toplevel/ac/ace0beef'. If those first directories start
getting crowded, just add another layer of subdirs underneath
each of those. As long as you abstract out the mapping
you can change the underlying storage hierarchy without
having to alter your programs.
| [reply] [d/l] |
|
The example code was running on Windows, not Unix.
However your other points are true. I admit that I am guessing as to how the Windows dir function is running so much faster than the simple Perl shown.
But one note. It may be that your filenames don't divide well based on the first few characters. (So one directory has a ton of directories, the rest do not.) In that case the above scheme can be improved by first taking an MD5 hash of the filename, and then placing files into directory locations depending on the characters in the MD5 hash.
(-:At which point your on-disk data storage is starting to be a frozen form of efficient data structures you might learn about in an algorithms course...:-)
| [reply] |
|
|
I would have realised that lc was out of place when my code would fail at university tomorrow morning. =)
So it looks like parsing until I can write an XS module for it. Thanks very much for the explanation.
| [reply] |
Re: I'm falling asleep here
by VSarkiss (Monsignor) on Oct 21, 2001 at 04:24 UTC
|
A couple of things to try. First, it's not clear from your snippet whether you're concatenating the directory to each file or not. Given a large directory, you're better off doing a chdir there, then opening and reading from ".", because each stat doesn't have to traverse the path. (It's probably cached, depending on your kernel, but this way it won't even matter.) As I said, maybe you're already doing this, but it's not clear from the code.
Another thing is to separate reading the directory from stat'ing it. That is, slurp the directory contents in one go, then traverse through the array. This will eat up a lot more memory, but it may be a worthwhile tradeoff (you'll have to make that call). Combining those, your loop would look like something like this:
my $dir = '/some/where/';
chdir $dir or die "Can't chdir $dir\n";
opendir D, '.' or die "Can't opendir '.'\n";
my @files = readdir(D);
closedir D;
my %filesize;
foreach my $f (@files)
{
$filesize{$f} = -s $f;
}
None of these are guaranteed to make it faster; that actually depends on your underlying operating system more. But they may be worth trying.
HTH | [reply] [d/l] [select] |
Re: I'm falling asleep here
by demerphq (Chancellor) on Oct 21, 2001 at 17:15 UTC
|
It's the -s bit that slows down the program, the stat takes about 5 mins to execute on 10,000 files, while 'ls' throws a listing up in less than 10 secs, and similar performance for Windows 'dir'.
There is something weird about all of this. I did a few tests on a 700mhz W2k box and averaged around 3 seconds to stat 20000 files in a local directory. Doing the readdir into an array had a negligable effect on performance. I also compared against a backticked dir command (I didn't parse it) which took 1 second.
My only idea as to why you are seeing the performance you are is that you are going over a network, but you already said the directories where local so there must be something else going on.
Incidentally I wrote an MD5 dupe checker once and it usually ran over 40k small files spread through a flat directory tree in a matter of minutes, including unlinks.
Yves
--
You are not ready to use symrefs unless you already know why they are bad. -- tadmc (CLPM) | [reply] |
Re: I'm falling asleep here
by virtualsue (Vicar) on Oct 21, 2001 at 19:21 UTC
|
VSarkiss's suggestion to separate
the directory read from the file stat allows you to be
more perlish, though it won't solve your ultimate
problem (molasses-like run times on your PC).
The following idiom (adapted from an example in the Camel)
puts the names of all files called '*.log' or '*.LOG'¹ in
DIR into the @logfiles array:
my @logfiles = grep { $_ =~ /\.log$/i } readdir DIR;
You should use the /i switch in cases like this rather
than 'lc', since that changes the filename that you read
and that's not really what you want (as tilly pointed out).
What's the spec of the PC that you're using? On our oldest,
slowest PC (P233/64Mb, Win98) this operation takes 7 minutes,
compared to just under 5 minutes on a newer but not super
slick PC (Celeron 500/32Mb, WinMe), accessing the files
over our LAN (3 mins when the files were moved to a local directory).
By contrast, this runs in less than a second
on a Sun Ultra 5...
¹(Update) It will also pick up files called *.Log, *.LOg, *.lOg, etc.
I assume that's probably not a big deal.
| [reply] [d/l] |
Re: I'm falling asleep here
by uwevoelker (Pilgrim) on Oct 21, 2001 at 21:37 UTC
|
foreach $file (glob('*.log')) {
...
}
Maybe it is faster.
| [reply] [d/l] |
|
Unfortunately, it isn't. It's actually a little slower on
my desktop Sun and horrendous on my older win98 PC. On the
latter it processed 10,000 files in 13 minutes, which is nearly
twice as slow as the original program (7 mins).
The similar contruct while (<*.log>) was
even slower on both platforms (Sun/old PC).
And here I was thinking that this wasn't a Perl problem.
It isn't, really, but it is apparently always possible to
make a bad situation worse. :)
| [reply] [d/l] |
Re: -s takes too long on 15,000 files
by Aristotle (Chancellor) on Oct 23, 2001 at 06:51 UTC
|
local $_;
for(readdir DIR) {
$filesize{$_} = -s if /\.log$/i;
}
But that buys you nothing.
I can't imagine any reason for the stat to be so slow; that performance sounds more like it's open()ing the files than stat()ing them. (Imagine a 10 minute break here as I'm posting.) Running out of straws to grasp.. have you tried explicitly stat()ing the files and picking up the size value rather? As in:
local $_;
for(readdir DIR) {
$filesize{$_} = (stat($_))[7] if /\.log$/i;
}
Chances are it 99.9% likely won't make a difference.. but.. | [reply] [d/l] [select] |
|
Nope, it says in the docs that nearly all the -X tests call stat anyway, and they're damn right too! I tried it both ways.
| [reply] |
Re: -s takes too long on 15,000 files
by demerphq (Chancellor) on Feb 27, 2006 at 17:15 UTC
|
The answer to this turns out to be some support for some rarely required functionality meant that perls stat had to open and then close each one of those files. Later perls will/should support ${^WIN32_SLOPPY_STAT}=1; which will disable this behaviour.
Sorry about how long it took to figure it out. :-)
---
$world=~s/war/peace/g
| [reply] |
|
| [reply] |
|
The POSIX system on NT/Win2k allows the creation of hardlinks to files if the user doing the creation has Backup/Restore privs. Jan Dubois put a patch into win32 perl that allows perlfunc:link to create symbolic links using this feature. (That this is possible is poorly and confusingly documented, both by Microsoft and by ActiveState/Perl. For the latter you can find it documented in perlport)
One drawback of this however is that various attributes somehow dont propagate to the linked file unless its opened first (a caching bug apparently). So in order to cater for the 0.00000000000001 percent of the files out there that are hardlinks on Win32, perl was made to open and close every file it stats on that OS. :-(
Anyway, 5.10 will support the new var (which I had the honor of naming) which will disable this behaviour. And so hopefully will the next 5.8.x release. Along with using the newish sitecustomize.pl support you can set this variable to true for your perl and bypass this behaviour if you know it wont be a problem.
This should mean some signifigant speedups. According to Jan (who quite naturally beat me to an implementation for this) the new behaviour shows
The STRICT test uses 30-50% more wall clock time on a local
NTFS system, 100% more against a Samba server and about 200% more against a Windows file server. This is very approximate, as times even on the local disk vary widely.
There are also JUNCTIONS on later NT/Win32 versions, but i dont think they cause stat a problem.
---
$world=~s/war/peace/g
| [reply] |
|
|