http://qs1969.pair.com?node_id=11149910


in reply to Script exponentially slower as number of files to process increases

I think your problem is that fork is an expensive operation. The system has to copy the parent process with all its memory (*), duplicate its file descriptors etc. And you do that for each file in the queue, then throw away the process and fork a new one.

What you want is a worker pool with long lived worker processes that divide the workload more or less evenly. The threads example earlier in this thread does exactly that. At the moment I can't find an equivalent, ready-made module with forks (Parallel::ForkManager came to mind, but that one doesn't work for this), but you can make do by preparing the divided workload yourself then explicitly assigning the parts to the forked workers:

#!/usr/bin/perl use strict; use warnings; use 5.34.0; use Env; use utf8; use POSIX "sys_wait_h"; #for waitpid FLAGS use Time::HiRes qw(gettimeofday tv_interval); use open ':std', ':encoding(UTF-8)'; my $benchmark = 1; # print timings for loops my $TMP='/tmp'; my $HOME = $ENV{HOME}; my $IN; my $OUT; my @data = glob("data-* ??/data-*"); my $filecount = scalar(@data); die if $filecount < 0; say "Parsing $filecount files"; my $wordfile="data.dat"; truncate $wordfile, 0; #$|=1; # substitute whole words my %whole = qw{ going go getting get goes go knew know trying try tried try told tell coming come saying say men man women woman took take lying lie dying die }; # substitute on prefix my %prefix = qw{ need need talk talk tak take used use using use }; # substitute on substring my %substring = qw{ mean mean work work read read allow allow gave give bought buy want want hear hear came come destr destroy paid pay selve self cities city fight fight creat create makin make includ include }; my $re1 = qr{\b(@{[ join '|', reverse sort keys %whole ]})\b}i; my $re2 = qr{\b(@{[ join '|', reverse sort keys %prefix ]})\w*}i; my $re3 = qr{\b\w*?(@{[ join '|', reverse sort keys %substring ]})\w*} +i; truncate $wordfile, 0; my $maxforks = 64; print "maxforks: $maxforks\n"; my $forkcount = 0; my $infile; my $subdir = 0; my $subdircount = 255; my $tempdir = "temp"; mkdir "$tempdir"; mkdir "$tempdir/$subdir" while ($subdir++ <= $subdircount); $subdir = 0; my $i = 0; my $t0 = [gettimeofday]; my $elapsed; my $batch_size = int(@data / $maxforks) + 1; my @batched_data; push @batched_data, [splice @data, 0, $batch_size] while @data; for my $worker_id (0..$maxforks-1) { if (my $pid = fork) { ++$forkcount; } else { for my $i (0..$#{$batched_data[$worker_id]}) { my $infile = $batched_data[$worker_id][$i]; my $subdir = $worker_id + 1; open my $IN, '<', $infile or exit(0); open my $OUT, '>', "$tempdir/$subdir/text-$i" or exit(0); while (<$IN>) { tr/-!"#%&()*',.\/:;?@\[\\\]”_“{’}><^)(|/ /; # no punct + " s/^/ /; s/\n/ \n/; s/[[:digit:]]{1,12}//g; s/w(as|ere)/be/gi; s{$re2}{ $prefix{lc $1} }g; # prefix s{$re3}{ $substring{lc $1} }g; # part s{$re1}{ $whole{lc $1} }g; # whole print $OUT "$_"; } close $OUT; close $IN; } defined $pid and exit(0); # $pid==0 -->child, must exit itself } } ### now wait for all children to finish, no matter who they are 1 while wait != -1; # avoid zombies this is a blocking operation local @ARGV = glob("$tempdir/*/*"); my @text = <>; unlink glob "$tempdir/*/*"; open $OUT, '>', $wordfile or die "Error opening $wordfile"; print $OUT @text; close $OUT; $elapsed = tv_interval($t0); print "regex: $elapsed\n" if $benchmark;

Note that this version does not process the files in the same order as yours, but that doesn't appear to matter?

(*) yes, it says that in the manual that "Under Linux, fork() is implemented using copy-on-write pages, so the only penalty that it incurs is the time and memory required to duplicate the parent's page tables, and to create a unique task structure for the child", but even on a fast, modern system it is going to take at least a few milliseconds per fork.