learnedbyerror has asked for the wisdom of the Perl Monks concerning the following question:
Hello Wise Monks
The village idiot returns, again asking questions about optimization
I am working with large batches of files (over 1,500,000 files in over 7,000 directory branches). I have a set of complex regexes that I use to modify directory and file names to get them to conform to a standard format. While I am a fan of xdg's most wonderful Path::Iterator::Rule, its performance in this case is insufficient.
I have searched the hallowed halls of this monastery and found other seekers also searching for performant path iteration (Fastest way to recurse through VERY LARGE directory tree, Get useful info about a directory tree and others).
I have also created the benchmark below to allow others locally test benchmarks to see the performance differences that I see.
While it pains me to use it, by far the fastest option that I have found to iterate over paths with a large number of directories and files is to call the binary find utility and read it is results via a pipe. The following table contains the output from running the benchmark on a Macbook Pro, I9 with 32GB RAM and a fast SSD. Test on slower systems favor the binary find even more greatly.
Rate PIR PIR fast File::Find perl readdir find pipe iter find pipe
PIR 1.04/s -- -7% -29% -53% -78% -80%
PIR fast 1.11/s 7% -- -24% -50% -76% -79%
File::Find 1.46/s 41% 32% -- -34% -69% -72%
perl readdir 2.20/s 112% 98% 51% -- -53% -58%
find pipe iter 4.71/s 354% 325% 222% 114% -- -11%
find pipe 5.30/s 411% 377% 262% 141% 12% --
I am currently using code based on the get_pipe_iter function below (with error handling) and am satisfied with the result; however, I do not like having to call an external binary. My last thought is to create a module that uses the POSIX nftw function to perform the work and see how that performs. The challenge for me is that my C/XS skills are limited. Before I embark on this journey, I am asking to see if any of you are aware of previous art that incorporates POSIX nftw with perl as either XS, Inline C or FFI. I would much prefer to stand on the shoulders of another than build another variant.
Thanks in advance for your sage guidance
lbe
Code for test-find.pl below:
#!/usr/bin/env perl use strict; use warnings; use utf8; use Benchmark qw(:all); use File::Find; use PIR; my $Usage = "$0 some/path\n"; die $Usage unless @ARGV and -d $ARGV[0]; open( my $fh, ">", '/dev/null' ); cmpthese( 50, { 'File::Find' => \&try_find, 'find pipe' => \&try_pipe, 'find pipe iter' => \&try_pipe_iter, 'PIR' => \&try_pir, 'PIR fast' => \&try_pir_fast, 'perl readdir' => \&try_readdir, } ); sub try_find { find( { wanted => sub { print $fh $_ if ( -f $_ ); }, follow_fast => 1 }, $ARGV[0] ); } sub try_pipe { open( FIND, "find -L $ARGV[0] -type f |" ); while (<FIND>) { print $fh $_; } close FIND; } sub get_pipe_iter { open( my $FH, "-|", "find -L $ARGV[0] -type f" ); return ( sub { return ( <$FH> ); } ); } sub try_pipe_iter { my $next = get_pipe_iter(); while ( defined( my $file = $next->() ) ) { print $fh $file; } } sub try_pir { my $rule = PIR->new; my $next = $rule->iter( $ARGV[0] ); while ( defined( my $file = $next->() ) ) { print $fh $file; } } sub try_pir_fast { my $rule = PIR->new; my $next = $rule->iter_fast( $ARGV[0] ); while ( defined( my $file = $next->() ) ) { print $fh $file; } } sub try_readdir { my ( @dirs ) = @_ ? @_ : @ARGV; foreach my $dir ( @dirs ) { opendir DIR, $dir; my @entries = readdir DIR; closedir DIR; foreach my $entry ( @entries ) { next if ( $entry =~ m/^\.+$/ ); if ( -d "$dir/$entry" ) { try_readdir( "$dir/$entry" ); } else { print $fh "$dir/$entry"; } } } }
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Performant Path Iteration
by karlgoethebier (Abbot) on Apr 22, 2019 at 18:15 UTC | |
by learnedbyerror (Monk) on Apr 23, 2019 at 04:18 UTC | |
by karlgoethebier (Abbot) on Apr 23, 2019 at 16:21 UTC | |
|
Re: Performant Path Iteration
by LanX (Saint) on Apr 22, 2019 at 19:16 UTC | |
by learnedbyerror (Monk) on Apr 23, 2019 at 04:24 UTC | |
|
Re: Performant Path Iteration
by Anonymous Monk on Apr 22, 2019 at 07:36 UTC | |
by learnedbyerror (Monk) on Apr 23, 2019 at 04:10 UTC | |
|
Re: Performant Path Iteration
by perlancar (Hermit) on Apr 22, 2019 at 07:49 UTC | |
by learnedbyerror (Monk) on Apr 23, 2019 at 04:02 UTC |