comment on

Hello, monks. With your help I've written a script that processes a large number of text files, efficiently. I run this script inside directories containing 1K to 10K files, usually less than 5K.

However, I've noticed that attempting to process larger number of files, i.e. several directories at once, the script gets exponentially slower. For example, while a run on 3.5K files would takes around 4.5 seconds, on 35K files takes 90 instead of 45 seconds and on 350K files it runs for hours.

This has baffled me, as I'm using subdirectories to organize the data, and filesystem operations shouldn't impact performance negatively; additionally, the data filenames are glob()bed into an array which is looped over and not slurped in at once and processed in bulk (although, in my tests I tried that approach which exhibited the same behavior).

What's very interesting is that when I put a counter to stop processing at 1000 files, I got increasingly longer processing times with each subdirectory added to the list, despite only processing 1000 files from it. Also, I always copy my data to /tmp which is mounted as tmpfs to reduce SSD wear and achieve maximum read/write performance. Testing:

wget http://www.astro.sunysb.edu/fwalter/AST389/TEXTS/Nightfall.htm
html2text-cpp Nightfall.htm >nightfall.txt
mkdir 00; for i in `seq -w 0 3456`; do head -$((RANDOM/128)) nightfall
+.txt >00/data-$i; done
[download]

This will create a directory ("00") with 3,456 random sized files inside. Perl script:

#!/usr/bin/perl
use strict;
use warnings;
use 5.36.0;
use Env;
use utf8;
use POSIX "sys_wait_h"; #for waitpid FLAGS
use Time::HiRes qw(gettimeofday tv_interval);
use open ':std', ':encoding(UTF-8)';

my $benchmark = 1; # print timings for loops
my $TMP='/tmp';
my $HOME = $ENV{HOME};
my $IN;
my $OUT;
my @data = glob("data-* ??/data-*");
my $filecount = scalar(@data);
die if $filecount < 0;

say "Parsing $filecount files";
my $wordfile="data.dat";
truncate $wordfile, 0;
#$|=1;
# substitute whole words
my %whole = qw{
  going go
  getting get
  goes go
  knew know
  trying try
  tried try
  told tell
  coming come
  saying say
  men man
  women woman
  took take
  lying lie
  dying die
};
# substitute on prefix
my %prefix = qw{
  need need
  talk talk
  tak take
  used use
  using use
};
# substitute on substring
my %substring = qw{
  mean mean
  work work
  read read
  allow allow
  gave give
  bought buy
  want want
  hear hear
  came come
  destr destroy
  paid pay
  selve self
  cities city
  fight fight
  creat create
  makin make
  includ include
};
my $re1 = qr{\b(@{[ join '|', reverse sort keys %whole ]})\b}i;
my $re2 = qr{\b(@{[ join '|', reverse sort keys %prefix ]})\w*}i;
my $re3 = qr{\b\w*?(@{[ join '|', reverse sort keys %substring ]})\w*}
+i;

truncate $wordfile, 0;
my $maxforks = 64;
print "maxforks: $maxforks\n";
my $forkcount = 0;
my $infile;
my $subdir = 0;
my $subdircount = 255;
my $tempdir = "temp";
mkdir "$tempdir";
mkdir "$tempdir/$subdir" while ($subdir++ <= $subdircount);
$subdir = 0;
my $i = 0;
my $t0 = [gettimeofday];
my $elapsed;
foreach $infile(@data) {
    $forkcount -= waitpid(-1, WNOHANG) > 0 while $forkcount >= $maxfor
+ks;
#    do { $elapsed=tv_interval($t0); print "elapsed: $elapsed\n"; die;
+ } if  $i++ >1000;  # 1000 files test
    $i++; # comment out if you uncomment the above line
    $subdir = 1 if $subdir++ > $subdircount;
    if (my $pid = fork) { # $pid defined and !=0 -->parent
        ++$forkcount;
    } else { # $pid==0 -->child
        open my $IN, '<', $infile or exit(0);
        open my $OUT, '>', "$tempdir/$subdir/text-$i" or exit(0);
        while (<$IN>) {
            tr/-!"#%&()*',.\/:;?@\[\\\]農怒筑><^)(|/ /; # no punct "
            s/^/ /;
            s/\n/ \n/;
            s/[[:digit:]]{1,12}//g;
            s/w(as|ere)/be/gi;
            s{$re2}{ $prefix{lc $1} }g;  # prefix
            s{$re3}{ $substring{lc $1} }g;  # part
            s{$re1}{ $whole{lc $1} }g;  # whole
            print $OUT "$_";
        }
        close $OUT;
        close $IN;
        defined $pid and exit(0); # $pid==0 -->child, must exit itself
    }
}
### now wait for all children to finish, no matter who they are
1 while wait != -1;  # avoid zombies this is a blocking operation

local @ARGV = glob("$tempdir/*/*");
my @text = <>;
unlink glob "$tempdir/*/*";
open $OUT, '>', $wordfile or die "Error opening $wordfile";
print $OUT @text;
close $OUT;

$elapsed = tv_interval($t0);
print "regex: $elapsed\n" if $benchmark;
[download]

Add more directories to process:

for dir in $(seq -w 01 10); do cp -a 00 $dir; done

Any help and insight will be greatly appreciated.

In reply to Script exponentially slower as number of files to process increases by xnous

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


We don't bite newbies here... much
	PerlMonks