comment on

Devel::NYTProf can handle forking code (use nytprofmerge). I don't have html2text-cpp available, so I can't replicate your experiment. But are you sure your machine has enough resources to run 64 forks all reading the same file and each writing to a different one? Also, waitpid might be non-blocking, but it still takes some time to process, and you're calling it millions of times.

Update: I rewrote your script to use Thread::Queue using 8 threads on my 4 CPU machine. Processing just 00 takes

Parsing 3457 files
regex: 4.126449

real    0m4.194s
user    0m13.465s
sys    0m0.414s
[download]

while processing 00 .. 04

Parsing 17285 files
regex: 20.588749

real    0m20.681s
user    1m7.373s
sys    0m1.873s
[download]

Nothing exponential here, right?

#!/usr/bin/perl
use strict;
use feature qw{ say };
use warnings;
use Env;
use utf8;
use Time::HiRes qw(gettimeofday tv_interval usleep);
use open ':std', ':encoding(UTF-8)';
use threads;
use Thread::Queue;


my $benchmark = 1; # print timings for loops
my $TMP='./tmp';
my $IN;
my $OUT;
my @data = glob("data-* ??/data-*");
my $filecount = scalar(@data);
die if $filecount < 0;

say "Parsing $filecount files";
my $wordfile="data.dat";
truncate $wordfile, 0;
#$|=1;
# substitute whole words
my %whole = qw{
  going go
  getting get
  goes go
  knew know
  trying try
  tried try
  told tell
  coming come
  saying say
  men man
  women woman
  took take
  lying lie
  dying die
};
# substitute on prefix
my %prefix = qw{
  need need
  talk talk
  tak take
  used use
  using use
};
# substitute on substring
my %substring = qw{
  mean mean
  work work
  read read
  allow allow
  gave give
  bought buy
  want want
  hear hear
  came come
  destr destroy
  paid pay
  selve self
  cities city
  fight fight
  creat create
  makin make
  includ include
};
my $re1 = qr{\b(@{[ join '|', reverse sort keys %whole ]})\b}i;
my $re2 = qr{\b(@{[ join '|', reverse sort keys %prefix ]})\w*}i;
my $re3 = qr{\b\w*?(@{[ join '|', reverse sort keys %substring ]})\w*}
+i;

truncate $wordfile, 0;
my $threads = 8;
my $forkcount = 0;
my $infile;
my $subdir = 0;
my $subdircount = 255;
my $tempdir = "temp";
mkdir "$tempdir";
mkdir "$tempdir/$subdir" while ($subdir++ <= $subdircount);
$subdir = 0;
my $i = 0;
my $t0 = [gettimeofday];
my $elapsed;

my $queue = 'Thread::Queue'->new;

sub process_file {
    while (my $task = $queue->dequeue) {
        my ($infile, $subdir, $i) = @$task;
        open my $IN, '<', $infile or exit(0);
        open my $OUT, '>', "$tempdir/$subdir/text-$i" or exit(0);
        while (<$IN>) {
            tr/-!"#%&()*',.\/:;?@\[\\\]農怒筑><^)(|/ /; # no punct "
            s/^/ /;
            s/\n/ \n/;
            s/[[:digit:]]{1,12}//g;
            s/w(as|ere)/be/gi;
            s{$re2}{ $prefix{lc $1} }g;  # prefix
            s{$re3}{ $substring{lc $1} }g;  # part
            s{$re1}{ $whole{lc $1} }g;  # whole
            print $OUT "$_";
        }
        close $OUT;
        close $IN;
    }
}
my @workers = map threads->create(\&process_file), 1 .. $threads;

foreach $infile (@data) {
    $subdir = 1 if $subdir++ > $subdircount;
    $queue->enqueue([$infile, $subdir, $i++]);
}
$queue->end;
$_->join for @workers;

local @ARGV = glob("$tempdir/*/*");
open $OUT, '>', $wordfile or die "Error opening $wordfile";
print {$OUT} $_ while <>;
close $OUT;
unlink glob "$tempdir/*/*";

$elapsed = tv_interval($t0);
print "regex: $elapsed\n" if $benchmark;
[download]

Note that I didn't try to understand what the code does, I just replaced fork with threads.

I also changed the final output to work line by line instead of reading the whole contents into memory, but it didn't help when using fork.

map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

In reply to Re: Script exponentially slower as number of files to process increases by choroba
in thread Script exponentially slower as number of files to process increases by xnous

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.