Performant Path Iteration

learnedbyerror has asked for the wisdom of the Perl Monks concerning the following question:

Hello Wise Monks

The village idiot returns, again asking questions about optimization

I am working with large batches of files (over 1,500,000 files in over 7,000 directory branches). I have a set of complex regexes that I use to modify directory and file names to get them to conform to a standard format. While I am a fan of xdg's most wonderful Path::Iterator::Rule, its performance in this case is insufficient.

I have searched the hallowed halls of this monastery and found other seekers also searching for performant path iteration (Fastest way to recurse through VERY LARGE directory tree, Get useful info about a directory tree and others).

I have also created the benchmark below to allow others locally test benchmarks to see the performance differences that I see.

While it pains me to use it, by far the fastest option that I have found to iterate over paths with a large number of directories and files is to call the binary find utility and read it is results via a pipe. The following table contains the output from running the benchmark on a Macbook Pro, I9 with 32GB RAM and a fast SSD. Test on slower systems favor the binary find even more greatly.

                 Rate  PIR PIR fast File::Find perl readdir find pipe iter find pipe
PIR            1.04/s   --      -7%       -29%         -53%           -78%      -80%
PIR fast       1.11/s   7%       --       -24%         -50%           -76%      -79%
File::Find     1.46/s  41%      32%         --         -34%           -69%      -72%
perl readdir   2.20/s 112%      98%        51%           --           -53%      -58%
find pipe iter 4.71/s 354%     325%       222%         114%             --      -11%
find pipe      5.30/s 411%     377%       262%         141%            12%        --

I am currently using code based on the get_pipe_iter function below (with error handling) and am satisfied with the result; however, I do not like having to call an external binary. My last thought is to create a module that uses the POSIX nftw function to perform the work and see how that performs. The challenge for me is that my C/XS skills are limited. Before I embark on this journey, I am asking to see if any of you are aware of previous art that incorporates POSIX nftw with perl as either XS, Inline C or FFI. I would much prefer to stand on the shoulders of another than build another variant.

Thanks in advance for your sage guidance

lbe

Code for test-find.pl below:

#!/usr/bin/env perl
use strict;
use warnings;
use utf8;

use Benchmark qw(:all);
use File::Find;
use PIR;

my $Usage = "$0 some/path\n";
die $Usage unless @ARGV and -d $ARGV[0];

open( my $fh, ">", '/dev/null' );

cmpthese(
    50,
    {
        'File::Find' => \&try_find,
        'find pipe' => \&try_pipe,
        'find pipe iter' => \&try_pipe_iter,
        'PIR' => \&try_pir,
        'PIR fast' => \&try_pir_fast,
        'perl readdir' => \&try_readdir,
     }
);

sub try_find {
    find( 
        { 
            wanted => sub {
                print $fh $_ if ( -f $_ );
            }, 
            follow_fast => 1 
        },
        $ARGV[0] );
}

sub try_pipe {
    open( FIND, "find -L $ARGV[0] -type f  |" );
    while (<FIND>) {
        print $fh $_;
    }
    close FIND;
}

sub get_pipe_iter {
    open( my $FH, "-|", "find -L $ARGV[0] -type f" );
    return ( sub { return ( <$FH> ); } );
} 

sub try_pipe_iter {
    my $next = get_pipe_iter();
    while ( defined( my $file = $next->() ) ) {
        print $fh $file;
    }
}

sub try_pir {
    my $rule = PIR->new;
    my $next = $rule->iter( $ARGV[0] );
    while ( defined( my $file = $next->() ) ) {
        print $fh $file;
    }
}

sub try_pir_fast {
    my $rule = PIR->new;
    my $next = $rule->iter_fast( $ARGV[0] );
    while ( defined( my $file = $next->() ) ) {
        print $fh $file;
    }
}

sub try_readdir {
    my ( @dirs ) = @_ ? @_ : @ARGV;
    foreach my $dir ( @dirs ) {
        opendir DIR, $dir;
        my @entries = readdir DIR;
        closedir DIR;
        foreach my $entry ( @entries ) {
            next if ( $entry =~ m/^\.+$/ );
            if ( -d "$dir/$entry" ) {
                try_readdir( "$dir/$entry" );
            }
            else {
                print $fh "$dir/$entry";
            }
        }
    }
}
[download]

Comment on Performant Path Iteration Download Code

Replies are listed 'Best First'.
Re: Performant Path Iteration by karlgoethebier (Abbot) on Apr 22, 2019 at 18:15 UTC
Mmh, with find you have `-type f` but with PIR you don't have `$rule->file` set as far as i can see in a hurry. Just an idea. Regards, Karl ŤThe Crux of the Biscuit is the Apostropheť `perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'`Help	[reply] [d/l] [select]
Re^2: Performant Path Iteration by learnedbyerror (Monk) on Apr 23, 2019 at 04:18 UTC
Karl, good catch. Given the statistics of the corpus that I use (~ 350 directories compare to ~ 65,000 files), I didn't expect much difference, but I did get a surprise. Adding $rule->file made PIR (both iter and iter_fast) much slower than the alternatives! (FYI, this is using the Benchmark::Forking version of the program). Rate PIR PIR fast File::Find perl readdir find pipe iter find pipe PIR 0.622/s -- -36% -67% -77% -88% -89% PIR fast 0.975/s 57% -- -48% -64% -82% -83% File::Find 1.89/s 205% 94% -- -31% -64% -67% perl readdir 2.74/s 342% 181% 45% -- -48% -53% find pipe iter 5.32/s 757% 446% 181% 94% -- -8% find pipe 5.80/s 833% 495% 206% 111% 9% -- Cheers, lbe	[reply]
Re^3: Performant Path Iteration by karlgoethebier (Abbot) on Apr 23, 2019 at 16:21 UTC
"...much slower than the alternatives..." I'm sorry to hear this. Too bad. By instinct i expected the contrary. As salva wrote: "...the bottleneck is going to be the disk I/O...maybe the order used to traverse the file system ...could have some influence." Best regards, Karl ŤThe Crux of the Biscuit is the Apostropheť `perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'`Help	[reply] [d/l]
Re: Performant Path Iteration by LanX (Saint) on Apr 22, 2019 at 19:16 UTC
some thoughts Are you sure that reading from the filesystem is the bottleneck? Your results are very dependent on the topology, I'd say readdir has a better performance on "bigger" directories Do these files always change? Otherwise caching them in a DB and only refreshing them if the update timestamp of the directory changed might be worth a consideration. Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery FootballPerl is like chess, only without the dice}	[reply]
Re^2: Performant Path Iteration by learnedbyerror (Monk) on Apr 23, 2019 at 04:24 UTC
Answer by point Yes, monitoring the processors using htop shows that the CPU utilization is pretty low You are probably right. I create some test datasets to investigate The vast majority of the files changes with each iteration, so there would be limited benefit in this approach In several parallel workflows, I do a lot of caching, originally with BerkeleyDB but I moved to LMDB about a year ago and got a nice performance bump. Thanks for your response, lbe	[reply]
Re: Performant Path Iteration by Anonymous Monk on Apr 22, 2019 at 07:36 UTC
...https://xdg.me/blog/adventures-in-benchmarking-part-1/...	[reply]
Re^2: Performant Path Iteration by learnedbyerror (Monk) on Apr 23, 2019 at 04:10 UTC
Thanks for sharing this blog. I modified the script to use Benchmark::Forking and re-ran. It changed the values slightly, but not by too much. Rate PIR fast PIR File::Find perl readdir find pipe iter find pipe PIR fast 1.46/s -- -1% -26% -47% -69% -71% PIR 1.47/s 1% -- -25% -47% -68% -70% File::Find 1.98/s 35% 34% -- -29% -58% -60% perl readdir 2.78/s 90% 88% 40% -- -41% -44% find pipe iter 4.68/s 219% 217% 136% 68% -- -6% find pipe 5.00/s 241% 239% 153% 80% 7% -- Cheers, lbe	[reply]
Re: Performant Path Iteration by perlancar (Hermit) on Apr 22, 2019 at 07:49 UTC
Could you perhaps write a code to create the sample dataset (the directory tree) so we all can benchmark the recurse code too?	[reply]
Re^2: Performant Path Iteration by learnedbyerror (Monk) on Apr 23, 2019 at 04:02 UTC
I'll be happy to do so. It will probably have to wait for the week end though. lbe UPDATE: You can run it against any directory with a bit of heft against it and will likely see similar relative performance as long as the number of files is much greater than the number number of directory nodes	[reply]