ishk0 has asked for the wisdom of the Perl Monks concerning the following question:

I'm working on a log file parser, it works on a bunch of small files. I have a few sets of files from various sources and I need to verify they match, by first checking the file size. The thing is, there's about 15,000 of them, stored locally. Here's a snippet:
while($file = lc(readdir DIR)) { if($file =~ /\.log$/) { $filesize{$file} = -s $file; } }
It's the -s bit that slows down the program, the stat takes about 5 mins to execute on 10,000 files, while 'ls' throws a listing up in less than 10 secs, and similar performance for Windows 'dir'. I need this perl to be portable though, so is there a faster way to get this without having to parse shell commands? And can some explain why stat is so slow?

Replies are listed 'Best First'.
Re (tilly) 1: I'm falling asleep here
by tilly (Archbishop) on Oct 21, 2001 at 05:02 UTC
    Welcome to a real world demonstration of why algorithm efficiency matters. :-)

    The likely reason why stat is so slow is that each stat has to traverse the entire directory linearly to find the file that you are interested in. That means that you are going back to the directory structure 10,000 times, each time scanning on average about 5,000 entries to find the file of interest. The resulting 50,000,000 fetches of information is what is taking the time.

    You can verify this by printing every 100'th filename. You should see a progressive slowdown as the stats slow down due to having to scan more and more previous files.

    You would speed this up by a factor of 10 if you arranged to have 10 directories of 1000 files each. You would speed up by a much larger factor with a filesystem which was designed to handle directories with many small files. (Think ReiserFS on Linux.)

    The ideal answer, of course, would be to have your direct pass through the directory structure pull back not just the name, but also the associated metadata. That is what ls and dir do, and it is why they are so much faster. Unfortunately Perl's API doesn't give you direct access to that information.

    An incidental note. Using Perl does not guarantee portable code. For instance your use of lc will cause you porting problems on operating systems with case-sensitive filesystems (like Unix). (You will be statting a different file than you saw, in fact likely one that isn't there.)

      No, ls does do calls to stat (and/or lstat). Here's a chunk of the output from running truss ls -l . on FreeBSD:

      $ truss ls -l |& grep stat ... lstat("Makefile",0x809d24c) = 0 (0x0) lstat("cmp.c",0x809d348) = 0 (0x0) lstat("extern.h",0x809d44c) = 0 (0x0) lstat("ls.1",0x809d548) = 0 (0x0) ...

      The output will look very similar on other OSen as well. Depending on the implementation of ls it's either calling opendir and stat itself, or it's using a library (fts(3) on FreeBSD) to traverse things. But underneath they're all doing the same thing and making the same system calls.

      You are correct that many Unix filesystems don't cope well with large numbers of entries since it is a linear scan to locate something within a given directory (more than a few hundred may show performance problems, something over 1k will more than likely do so).

      A solution to this (if you can make changes to the way things are structured on disk but can't change to a fancier filesystem) is to use the initial characters of each entry as a key and have a second level hierarchy of subdirs for each key. For example if your files are named with hexadecimal numbers, you might have a top level directory with dirs `00' and `01' through `fe', `ff'. A file named `ace0beef' would be located in `toplevel/ac/ace0beef'. If those first directories start getting crowded, just add another layer of subdirs underneath each of those. As long as you abstract out the mapping you can change the underlying storage hierarchy without having to alter your programs.

        The example code was running on Windows, not Unix.

        However your other points are true. I admit that I am guessing as to how the Windows dir function is running so much faster than the simple Perl shown.

        But one note. It may be that your filenames don't divide well based on the first few characters. (So one directory has a ton of directories, the rest do not.) In that case the above scheme can be improved by first taking an MD5 hash of the filename, and then placing files into directory locations depending on the characters in the MD5 hash.

        (-:At which point your on-disk data storage is starting to be a frozen form of efficient data structures you might learn about in an algorithms course...:-)

      I would have realised that lc was out of place when my code would fail at university tomorrow morning. =)

      So it looks like parsing until I can write an XS module for it. Thanks very much for the explanation.

Re: I'm falling asleep here
by VSarkiss (Monsignor) on Oct 21, 2001 at 04:24 UTC

    A couple of things to try. First, it's not clear from your snippet whether you're concatenating the directory to each file or not. Given a large directory, you're better off doing a chdir there, then opening and reading from ".", because each stat doesn't have to traverse the path. (It's probably cached, depending on your kernel, but this way it won't even matter.) As I said, maybe you're already doing this, but it's not clear from the code.

    Another thing is to separate reading the directory from stat'ing it. That is, slurp the directory contents in one go, then traverse through the array. This will eat up a lot more memory, but it may be a worthwhile tradeoff (you'll have to make that call). Combining those, your loop would look like something like this:

    my $dir = '/some/where/'; chdir $dir or die "Can't chdir $dir\n"; opendir D, '.' or die "Can't opendir '.'\n"; my @files = readdir(D); closedir D; my %filesize; foreach my $f (@files) { $filesize{$f} = -s $f; }

    None of these are guaranteed to make it faster; that actually depends on your underlying operating system more. But they may be worth trying.

    HTH

Re: I'm falling asleep here
by demerphq (Chancellor) on Oct 21, 2001 at 17:15 UTC
    It's the -s bit that slows down the program, the stat takes about 5 mins to execute on 10,000 files, while 'ls' throws a listing up in less than 10 secs, and similar performance for Windows 'dir'.

    There is something weird about all of this. I did a few tests on a 700mhz W2k box and averaged around 3 seconds to stat 20000 files in a local directory. Doing the readdir into an array had a negligable effect on performance. I also compared against a backticked dir command (I didn't parse it) which took 1 second.

    My only idea as to why you are seeing the performance you are is that you are going over a network, but you already said the directories where local so there must be something else going on.

    Incidentally I wrote an MD5 dupe checker once and it usually ran over 40k small files spread through a flat directory tree in a matter of minutes, including unlinks.

    Yves
    --
    You are not ready to use symrefs unless you already know why they are bad. -- tadmc (CLPM)

Re: I'm falling asleep here
by virtualsue (Vicar) on Oct 21, 2001 at 19:21 UTC
    VSarkiss's suggestion to separate the directory read from the file stat allows you to be more perlish, though it won't solve your ultimate problem (molasses-like run times on your PC). The following idiom (adapted from an example in the Camel) puts the names of all files called '*.log' or '*.LOG'¹ in DIR into the @logfiles array:
    my @logfiles = grep { $_ =~ /\.log$/i } readdir DIR;
    You should use the /i switch in cases like this rather than 'lc', since that changes the filename that you read and that's not really what you want (as tilly pointed out).

    What's the spec of the PC that you're using? On our oldest, slowest PC (P233/64Mb, Win98) this operation takes 7 minutes, compared to just under 5 minutes on a newer but not super slick PC (Celeron 500/32Mb, WinMe), accessing the files over our LAN (3 mins when the files were moved to a local directory). By contrast, this runs in less than a second on a Sun Ultra 5...

    ¹(Update) It will also pick up files called *.Log, *.LOg, *.lOg, etc. I assume that's probably not a big deal.
Re: I'm falling asleep here
by uwevoelker (Pilgrim) on Oct 21, 2001 at 21:37 UTC
    You could use
    foreach $file (glob('*.log')) { ... }
    Maybe it is faster.
      Unfortunately, it isn't. It's actually a little slower on my desktop Sun and horrendous on my older win98 PC. On the latter it processed 10,000 files in 13 minutes, which is nearly twice as slow as the original program (7 mins). The similar contruct  while (<*.log>) was even slower on both platforms (Sun/old PC). And here I was thinking that this wasn't a Perl problem. It isn't, really, but it is apparently always possible to make a bad situation worse. :)
Re: -s takes too long on 15,000 files
by Aristotle (Chancellor) on Oct 23, 2001 at 06:51 UTC

    It just irks me to see that awkward code there..

    local $_; for(readdir DIR) { $filesize{$_} = -s if /\.log$/i; }

    But that buys you nothing.

    I can't imagine any reason for the stat to be so slow; that performance sounds more like it's open()ing the files than stat()ing them. (Imagine a 10 minute break here as I'm posting.) Running out of straws to grasp.. have you tried explicitly stat()ing the files and picking up the size value rather? As in:

    local $_; for(readdir DIR) { $filesize{$_} = (stat($_))[7] if /\.log$/i; }

    Chances are it 99.9% likely won't make a difference.. but..

      Nope, it says in the docs that nearly all the -X tests call stat anyway, and they're damn right too! I tried it both ways.
Re: -s takes too long on 15,000 files
by demerphq (Chancellor) on Feb 27, 2006 at 17:15 UTC

    The answer to this turns out to be some support for some rarely required functionality meant that perls stat had to open and then close each one of those files. Later perls will/should support ${^WIN32_SLOPPY_STAT}=1; which will disable this behaviour.

    Sorry about how long it took to figure it out. :-)

    ---
    $world=~s/war/peace/g

      some support for some rarely required functionality

      Care to explain what the functionality was/is?


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        The POSIX system on NT/Win2k allows the creation of hardlinks to files if the user doing the creation has Backup/Restore privs. Jan Dubois put a patch into win32 perl that allows perlfunc:link to create symbolic links using this feature. (That this is possible is poorly and confusingly documented, both by Microsoft and by ActiveState/Perl. For the latter you can find it documented in perlport)

        One drawback of this however is that various attributes somehow dont propagate to the linked file unless its opened first (a caching bug apparently). So in order to cater for the 0.00000000000001 percent of the files out there that are hardlinks on Win32, perl was made to open and close every file it stats on that OS. :-(

        Anyway, 5.10 will support the new var (which I had the honor of naming) which will disable this behaviour. And so hopefully will the next 5.8.x release. Along with using the newish sitecustomize.pl support you can set this variable to true for your perl and bypass this behaviour if you know it wont be a problem.

        This should mean some signifigant speedups. According to Jan (who quite naturally beat me to an implementation for this) the new behaviour shows

        The STRICT test uses 30-50% more wall clock time on a local NTFS system, 100% more against a Samba server and about 200% more against a Windows file server. This is very approximate, as times even on the local disk vary widely.

        There are also JUNCTIONS on later NT/Win32 versions, but i dont think they cause stat a problem.

        ---
        $world=~s/war/peace/g