comment on

Hello Wise Monks

The village idiot returns, again asking questions about optimization

I am working with large batches of files (over 1,500,000 files in over 7,000 directory branches). I have a set of complex regexes that I use to modify directory and file names to get them to conform to a standard format. While I am a fan of xdg's most wonderful Path::Iterator::Rule, its performance in this case is insufficient.

I have searched the hallowed halls of this monastery and found other seekers also searching for performant path iteration (Fastest way to recurse through VERY LARGE directory tree, Get useful info about a directory tree and others).

I have also created the benchmark below to allow others locally test benchmarks to see the performance differences that I see.

While it pains me to use it, by far the fastest option that I have found to iterate over paths with a large number of directories and files is to call the binary find utility and read it is results via a pipe. The following table contains the output from running the benchmark on a Macbook Pro, I9 with 32GB RAM and a fast SSD. Test on slower systems favor the binary find even more greatly.

                 Rate  PIR PIR fast File::Find perl readdir find pipe iter find pipe
PIR            1.04/s   --      -7%       -29%         -53%           -78%      -80%
PIR fast       1.11/s   7%       --       -24%         -50%           -76%      -79%
File::Find     1.46/s  41%      32%         --         -34%           -69%      -72%
perl readdir   2.20/s 112%      98%        51%           --           -53%      -58%
find pipe iter 4.71/s 354%     325%       222%         114%             --      -11%
find pipe      5.30/s 411%     377%       262%         141%            12%        --

I am currently using code based on the get_pipe_iter function below (with error handling) and am satisfied with the result; however, I do not like having to call an external binary. My last thought is to create a module that uses the POSIX nftw function to perform the work and see how that performs. The challenge for me is that my C/XS skills are limited. Before I embark on this journey, I am asking to see if any of you are aware of previous art that incorporates POSIX nftw with perl as either XS, Inline C or FFI. I would much prefer to stand on the shoulders of another than build another variant.

Thanks in advance for your sage guidance

lbe

Code for test-find.pl below:

#!/usr/bin/env perl
use strict;
use warnings;
use utf8;

use Benchmark qw(:all);
use File::Find;
use PIR;

my $Usage = "$0 some/path\n";
die $Usage unless @ARGV and -d $ARGV[0];

open( my $fh, ">", '/dev/null' );

cmpthese(
    50,
    {
        'File::Find' => \&try_find,
        'find pipe' => \&try_pipe,
        'find pipe iter' => \&try_pipe_iter,
        'PIR' => \&try_pir,
        'PIR fast' => \&try_pir_fast,
        'perl readdir' => \&try_readdir,
     }
);

sub try_find {
    find( 
        { 
            wanted => sub {
                print $fh $_ if ( -f $_ );
            }, 
            follow_fast => 1 
        },
        $ARGV[0] );
}

sub try_pipe {
    open( FIND, "find -L $ARGV[0] -type f  |" );
    while (<FIND>) {
        print $fh $_;
    }
    close FIND;
}

sub get_pipe_iter {
    open( my $FH, "-|", "find -L $ARGV[0] -type f" );
    return ( sub { return ( <$FH> ); } );
} 

sub try_pipe_iter {
    my $next = get_pipe_iter();
    while ( defined( my $file = $next->() ) ) {
        print $fh $file;
    }
}

sub try_pir {
    my $rule = PIR->new;
    my $next = $rule->iter( $ARGV[0] );
    while ( defined( my $file = $next->() ) ) {
        print $fh $file;
    }
}

sub try_pir_fast {
    my $rule = PIR->new;
    my $next = $rule->iter_fast( $ARGV[0] );
    while ( defined( my $file = $next->() ) ) {
        print $fh $file;
    }
}

sub try_readdir {
    my ( @dirs ) = @_ ? @_ : @ARGV;
    foreach my $dir ( @dirs ) {
        opendir DIR, $dir;
        my @entries = readdir DIR;
        closedir DIR;
        foreach my $entry ( @entries ) {
            next if ( $entry =~ m/^\.+$/ );
            if ( -d "$dir/$entry" ) {
                try_readdir( "$dir/$entry" );
            }
            else {
                print $fh "$dir/$entry";
            }
        }
    }
}
[download]

In reply to Performant Path Iteration by learnedbyerror

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.