ammon has asked for the wisdom of the Perl Monks concerning the following question:

The documentation for File::Find states that an lstat() is guaranteed to have been called when the 'follow' or 'follow_fast' options have been called. As a result, you can use the underline stat cache in the wanted() function.

When not using the 'follow' or 'follow_fast' options, however, there is no such guarantee of lstat() being called -- it may be called, or it may not be, and, consequently, you can't rely on the underline stat cache containing the stat info of the current file. You have to do it yourself.

In a typical file tree that I'm trawling, I'm finding that File::Find has not done a stat() call on about 80% of the files. The other 20% of files have had a stat() call done prior to the wanted() function call. On one set of test data, that 20% represents 6800+ files. Our filesystem gets too much wear and tear already, so I'd like to avoid duplicating those stat calls when File::Find has already done them.

Edit: How can I know whether File::Find has called lstat() or not?

My first thought is, at the end of my wanted() function, to clear the _ cache. That doesn't seem to be possible, aside from doing something like -l q() (it appears that the lstat() system call simply returns ENOENT when passed an empty string, without going to the filesystem). Is there a better way of invalidating the contents of _? The perlfunc documentation mentions how it gets set, but I haven't yet found anything which indicates that cached data can ever be reset, other than by another call to a stat or filetest.

My second question is a direct consequence of clearing _: is it possible to detect that _ is, or is not, a valid stat cache? My current method is:

my ($dev, $inode) = do { no warnings; lstat _ }; ($dev, $inode) = lstat $_ unless $dev;

Without the no warnings, I get a warning stating lstat() on unopened filehandle _. Unfortunately, I can't find any documentation that indicates any way to test if _ is valid. I thought I might be able to check the defined-ness of *::_{FILEHANDLE}, but was wrong (it generates a warning about being deprecated, and always returns an IO::Handle object. I also tried the simpler

my ($dev, $inode) = _ ? lstat _ : lstat $_;
but that generates an error saying Bareword "_" not allowed while "strict subs" in use.

Any suggestions are welcome.

Cheers,

Replies are listed 'Best First'.
Re: Can the special underline filehandle (stat cache) be cleared?
by Tanktalus (Canon) on Oct 04, 2006 at 03:39 UTC

    Our filesystem gets too much wear and tear

    I call "premature optimisation." I find it extremely unlikely that this actually will save much, if any, speed, and pretty much impossible that consecutive hits to the same directory will actually touch the physical media in any way, shape, or form. Your hard disk has a cache. Your filesystem driver has a cache. Your C library has a cache. I doubt that all of these will be emptied between the time that File::Find calls lstat and the time that your wanted sub calls lstat. If so, you probably have bigger issues than just how hard your perl code is hitting the disk.

    If anything, using _ merely avoids the repeated call to the C library's stat or lstat function, so you can save some function call overhead. But when you're hitting 34,000 files, I somehow doubt that CPU time is your limiting factor in your application's speed.

    Thus, my suggestion: relax. Don't fret the small stuff ;-)

      I'm not trying to save speed -- in fact, my benchmarking actually shows that what I'm trying to do is currently slower than just making the additional stat() calls.

      Yes, we have bigger issues than how hard my perl code is hitting the disk, but when those bigger issues consist of already problematic data throughput off our fileservers and around the network, my perl code does have a measurable negative impact (heck... the sys-admins don't even like us doing a simple du on the filesystem in question), particularly since it's intended to be used as a fundamental module in my department.

Re: Can the special underline filehandle (stat cache) be cleared? (nlinks)
by tye (Sage) on Oct 04, 2006 at 03:45 UTC

    You can get the same guantee by simply turning off the check-nlinks "optimization":

    $File::Fin­d::dont_us­e_nlink= 1;

    as I mention from-time-to-time such as in Re: File::Find in a thread safe fashion (speed).

    It is unfortunate that this fact isn't documented. It is also sad that this "fastest for the most common cases" method is not easier to use and that $dont_use_nlink has nearly been eliminated from the documentation -- the maintainers continue their delusion that they have succeeded in this latest repeat of trying yet again to make this "optimization" safe and refuse to realize that this "optimization" actually makes typical uses of File::Find slower (and more complex) than they need to be (it only speeds up very limited cases where you are selecting files solely based on their names, except when it just doesn't work right).

    Update: Doing some quick testing I find that File::Find may have broken this guarantee (as I said in the linked node, the code is now too complex for me to easily see that the guarantee still applies like it did ages ago when I first discovered this trick) and File::Find may no longer give faster results when you use it this way (like it did ages ago). My first suspicion given both of these results (which may be a result of flawed testing on my part, as it was just a quick hack of a test) is that File::Find is doing more than it needs to, but I'm not going to spend more time trying to figure out what's really going on with this module that I never use. Indeed, just rolling my own replacement for File::Find cuts my run-time in half.

    My quick hack of a test also validates Tanktalus' point, though testing eons ago did show a speed-up for me. Tanktalus++

    - tye        

      Ah, thanks for the suggestion about $File::Find::dont_use_nlink. I'll take a look at it. Unfortunately, I've spent enough time in File::Find that it doesn't look as complex as it used to (I've submitted patches for a couple bugs, as a result). :-}

      Update: yup... it's no longer a guarantee that a stat was done on the file.

      However, the more I look at File::Find, the more I want to follow your lead, and roll my own replacement for File::Find -- it's not exactly a stellar example of maintainable code.

      Cheers,

        So, I'm curious. When you have $File::Find::dont_use_nlink set, in what case does File::Find not lstat a file ?

        - tye        

Re: Can the special underline filehandle (stat cache) be cleared?
by jwkrahn (Abbot) on Oct 04, 2006 at 02:04 UTC
    $ perl -le' print defined -e _ ? "defined" : "undef"; lstat q!test.txt!; print defined -e _ ? "defined" : "undef"; lstat q!/not_a_real_file!; print defined -e _ ? "defined" : "undef"; ' undef defined undef
    It appears that stating a nonexistant file will clear _ and that a file test on _ will be undefined if _ is cleared. HTH
      Yes, as mentioned in the OP, my current method of clearing _ is to do an lstat() on the empty string. I was hoping there was a "more correct" way to clear the stat cache, because, while some implementations of the lstat() system call do optimize a stat of the empty string to not bother going to the file system, I'm skeptical about that in the general case.

      It is interesting to note that the file tests, such as -e do not exhibit the same behaviour, with respect to warnings, as stat() and lstat():

      $ perl -we'-e _' $ perl -we'stat _' stat() on unopened filehandle _ at -e line 1.

      That's a better solution than what I was using before, but I wonder if, as above, there's a "more correct" way to do it.

      Cheers,

        Note that calling lstat on an empty string on some systems will give you the same results as lstat("."). So this solution isn't portable.1

        - tye        

        1 Luckily, it isn't needed. Unluckily, I didn't notice that you were doing this before I replied with my better (IMO, at least) solution.