comment on

I think your problem is that fork is an expensive operation. The system has to copy the parent process with all its memory (*), duplicate its file descriptors etc. And you do that for each file in the queue, then throw away the process and fork a new one.

What you want is a worker pool with long lived worker processes that divide the workload more or less evenly. The threads example earlier in this thread does exactly that. At the moment I can't find an equivalent, ready-made module with forks (Parallel::ForkManager came to mind, but that one doesn't work for this), but you can make do by preparing the divided workload yourself then explicitly assigning the parts to the forked workers:

#!/usr/bin/perl
use strict;
use warnings;
use 5.34.0;
use Env;
use utf8;
use POSIX "sys_wait_h"; #for waitpid FLAGS
use Time::HiRes qw(gettimeofday tv_interval);
use open ':std', ':encoding(UTF-8)';

my $benchmark = 1; # print timings for loops
my $TMP='/tmp';
my $HOME = $ENV{HOME};
my $IN;
my $OUT;
my @data = glob("data-* ??/data-*");
my $filecount = scalar(@data);
die if $filecount < 0;

say "Parsing $filecount files";
my $wordfile="data.dat";
truncate $wordfile, 0;
#$|=1;
# substitute whole words
my %whole = qw{
  going go
  getting get
  goes go
  knew know
  trying try
  tried try
  told tell
  coming come
  saying say
  men man
  women woman
  took take
  lying lie
  dying die
};
# substitute on prefix
my %prefix = qw{
  need need
  talk talk
  tak take
  used use
  using use
};
# substitute on substring
my %substring = qw{
  mean mean
  work work
  read read
  allow allow
  gave give
  bought buy
  want want
  hear hear
  came come
  destr destroy
  paid pay
  selve self
  cities city
  fight fight
  creat create
  makin make
  includ include
};
my $re1 = qr{\b(@{[ join '|', reverse sort keys %whole ]})\b}i;
my $re2 = qr{\b(@{[ join '|', reverse sort keys %prefix ]})\w*}i;
my $re3 = qr{\b\w*?(@{[ join '|', reverse sort keys %substring ]})\w*}
+i;

truncate $wordfile, 0;
my $maxforks = 64;
print "maxforks: $maxforks\n";
my $forkcount = 0;
my $infile;
my $subdir = 0;
my $subdircount = 255;
my $tempdir = "temp";
mkdir "$tempdir";
mkdir "$tempdir/$subdir" while ($subdir++ <= $subdircount);
$subdir = 0;
my $i = 0;
my $t0 = [gettimeofday];
my $elapsed;

my $batch_size = int(@data / $maxforks) + 1;
my @batched_data;
push @batched_data, [splice @data, 0, $batch_size] while @data;

for my $worker_id (0..$maxforks-1) {
    if (my $pid = fork) {
        ++$forkcount;
    } else {
        for my $i (0..$#{$batched_data[$worker_id]}) {
            my $infile = $batched_data[$worker_id][$i];
            my $subdir = $worker_id + 1;
            open my $IN, '<', $infile or exit(0);
            open my $OUT, '>', "$tempdir/$subdir/text-$i" or exit(0);
            while (<$IN>) {
                tr/-!"#%&()*',.\/:;?@\[\\\]農怒筑><^)(|/ /; # no punct
+ "
                s/^/ /;
                s/\n/ \n/;
                s/[[:digit:]]{1,12}//g;
                s/w(as|ere)/be/gi;
                s{$re2}{ $prefix{lc $1} }g;  # prefix
                s{$re3}{ $substring{lc $1} }g;  # part
                s{$re1}{ $whole{lc $1} }g;  # whole
                print $OUT "$_";
            }
            close $OUT;
            close $IN;
        }
        defined $pid and exit(0); # $pid==0 -->child, must exit itself
    }
}

### now wait for all children to finish, no matter who they are
1 while wait != -1;  # avoid zombies this is a blocking operation

local @ARGV = glob("$tempdir/*/*");
my @text = <>;
unlink glob "$tempdir/*/*";
open $OUT, '>', $wordfile or die "Error opening $wordfile";
print $OUT @text;
close $OUT;

$elapsed = tv_interval($t0);
print "regex: $elapsed\n" if $benchmark;
[download]

Note that this version does not process the files in the same order as yours, but that doesn't appear to matter?

(*) yes, it says that in the manual that "Under Linux, fork() is implemented using copy-on-write pages, so the only penalty that it incurs is the time and memory required to duplicate the parent's page tables, and to create a unique task structure for the child", but even on a fast, modern system it is going to take at least a few milliseconds per fork.

In reply to Re: Script exponentially slower as number of files to process increases by kikuchiyo
in thread Script exponentially slower as number of files to process increases by xnous

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.