learnedbyerror has asked for the wisdom of the Perl Monks concerning the following question:

Hello Wise Monks

The village idiot returns, again asking questions about optimization

I am working with large batches of files (over 1,500,000 files in over 7,000 directory branches). I have a set of complex regexes that I use to modify directory and file names to get them to conform to a standard format. While I am a fan of xdg's most wonderful Path::Iterator::Rule, its performance in this case is insufficient.

I have searched the hallowed halls of this monastery and found other seekers also searching for performant path iteration (Fastest way to recurse through VERY LARGE directory tree, Get useful info about a directory tree and others).

I have also created the benchmark below to allow others locally test benchmarks to see the performance differences that I see.

While it pains me to use it, by far the fastest option that I have found to iterate over paths with a large number of directories and files is to call the binary find utility and read it is results via a pipe. The following table contains the output from running the benchmark on a Macbook Pro, I9 with 32GB RAM and a fast SSD. Test on slower systems favor the binary find even more greatly.

                 Rate  PIR PIR fast File::Find perl readdir find pipe iter find pipe
PIR            1.04/s   --      -7%       -29%         -53%           -78%      -80%
PIR fast       1.11/s   7%       --       -24%         -50%           -76%      -79%
File::Find     1.46/s  41%      32%         --         -34%           -69%      -72%
perl readdir   2.20/s 112%      98%        51%           --           -53%      -58%
find pipe iter 4.71/s 354%     325%       222%         114%             --      -11%
find pipe      5.30/s 411%     377%       262%         141%            12%        --

I am currently using code based on the get_pipe_iter function below (with error handling) and am satisfied with the result; however, I do not like having to call an external binary. My last thought is to create a module that uses the POSIX nftw function to perform the work and see how that performs. The challenge for me is that my C/XS skills are limited. Before I embark on this journey, I am asking to see if any of you are aware of previous art that incorporates POSIX nftw with perl as either XS, Inline C or FFI. I would much prefer to stand on the shoulders of another than build another variant.

Thanks in advance for your sage guidance

lbe

Code for test-find.pl below:

#!/usr/bin/env perl use strict; use warnings; use utf8; use Benchmark qw(:all); use File::Find; use PIR; my $Usage = "$0 some/path\n"; die $Usage unless @ARGV and -d $ARGV[0]; open( my $fh, ">", '/dev/null' ); cmpthese( 50, { 'File::Find' => \&try_find, 'find pipe' => \&try_pipe, 'find pipe iter' => \&try_pipe_iter, 'PIR' => \&try_pir, 'PIR fast' => \&try_pir_fast, 'perl readdir' => \&try_readdir, } ); sub try_find { find( { wanted => sub { print $fh $_ if ( -f $_ ); }, follow_fast => 1 }, $ARGV[0] ); } sub try_pipe { open( FIND, "find -L $ARGV[0] -type f |" ); while (<FIND>) { print $fh $_; } close FIND; } sub get_pipe_iter { open( my $FH, "-|", "find -L $ARGV[0] -type f" ); return ( sub { return ( <$FH> ); } ); } sub try_pipe_iter { my $next = get_pipe_iter(); while ( defined( my $file = $next->() ) ) { print $fh $file; } } sub try_pir { my $rule = PIR->new; my $next = $rule->iter( $ARGV[0] ); while ( defined( my $file = $next->() ) ) { print $fh $file; } } sub try_pir_fast { my $rule = PIR->new; my $next = $rule->iter_fast( $ARGV[0] ); while ( defined( my $file = $next->() ) ) { print $fh $file; } } sub try_readdir { my ( @dirs ) = @_ ? @_ : @ARGV; foreach my $dir ( @dirs ) { opendir DIR, $dir; my @entries = readdir DIR; closedir DIR; foreach my $entry ( @entries ) { next if ( $entry =~ m/^\.+$/ ); if ( -d "$dir/$entry" ) { try_readdir( "$dir/$entry" ); } else { print $fh "$dir/$entry"; } } } }

Replies are listed 'Best First'.
Re: Performant Path Iteration
by karlgoethebier (Abbot) on Apr 22, 2019 at 18:15 UTC

    Mmh, with find you have -type f but with PIR you don't have $rule->file set as far as i can see in a hurry. Just an idea. Regards, Karl

    «The Crux of the Biscuit is the Apostrophe»

    perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'Help

      Karl, good catch. Given the statistics of the corpus that I use (~ 350 directories compare to ~ 65,000 files), I didn't expect much difference, but I did get a surprise. Adding $rule->file made PIR (both iter and iter_fast) much slower than the alternatives! (FYI, this is using the Benchmark::Forking version of the program).

                        Rate  PIR PIR fast File::Find perl readdir find pipe iter find pipe
      PIR            0.622/s   --     -36%       -67%         -77%           -88%      -89%
      PIR fast       0.975/s  57%       --       -48%         -64%           -82%      -83%
      File::Find      1.89/s 205%      94%         --         -31%           -64%      -67%
      perl readdir    2.74/s 342%     181%        45%           --           -48%      -53%
      find pipe iter  5.32/s 757%     446%       181%          94%             --       -8%
      find pipe       5.80/s 833%     495%       206%         111%             9%        --
      

      Cheers, lbe

        "...much slower than the alternatives..."

        I'm sorry to hear this. Too bad. By instinct i expected the contrary. As salva wrote: "...the bottleneck is going to be the disk I/O...maybe the order used to traverse the file system ...could have some influence." Best regards, Karl

        «The Crux of the Biscuit is the Apostrophe»

        perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'Help

Re: Performant Path Iteration
by LanX (Saint) on Apr 22, 2019 at 19:16 UTC
    some thoughts
    1. Are you sure that reading from the filesystem is the bottleneck?
    2. Your results are very dependent on the topology, I'd say readdir has a better performance on "bigger" directories
    3. Do these files always change? Otherwise caching them in a DB and only refreshing them if the update timestamp of the directory changed might be worth a consideration.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

      Answer by point

      1. Yes, monitoring the processors using htop shows that the CPU utilization is pretty low
      2. You are probably right. I create some test datasets to investigate
      3. The vast majority of the files changes with each iteration, so there would be limited benefit in this approach In several parallel workflows, I do a lot of caching, originally with BerkeleyDB but I moved to LMDB about a year ago and got a nice performance bump.

      Thanks for your response, lbe

Re: Performant Path Iteration
by Anonymous Monk on Apr 22, 2019 at 07:36 UTC

      Thanks for sharing this blog. I modified the script to use Benchmark::Forking and re-ran. It changed the values slightly, but not by too much.

                       Rate PIR fast  PIR File::Find perl readdir find pipe iter find pipe
      PIR fast       1.46/s       --  -1%       -26%         -47%           -69%      -71%
      PIR            1.47/s       1%   --       -25%         -47%           -68%      -70%
      File::Find     1.98/s      35%  34%         --         -29%           -58%      -60%
      perl readdir   2.78/s      90%  88%        40%           --           -41%      -44%
      find pipe iter 4.68/s     219% 217%       136%          68%             --       -6%
      find pipe      5.00/s     241% 239%       153%          80%             7%        --
      

      Cheers, lbe

Re: Performant Path Iteration
by perlancar (Hermit) on Apr 22, 2019 at 07:49 UTC
    Could you perhaps write a code to create the sample dataset (the directory tree) so we all can benchmark the recurse code too?

      I'll be happy to do so. It will probably have to wait for the week end though.

      lbe

      UPDATE: You can run it against any directory with a bit of heft against it and will likely see similar relative performance as long as the number of files is much greater than the number number of directory nodes