Re: Script exponentially slower as number of files to process increases

Devel::NYTProf can handle forking code (use nytprofmerge). I don't have html2text-cpp available, so I can't replicate your experiment. But are you sure your machine has enough resources to run 64 forks all reading the same file and each writing to a different one? Also, waitpid might be non-blocking, but it still takes some time to process, and you're calling it millions of times.

Update: I rewrote your script to use Thread::Queue using 8 threads on my 4 CPU machine. Processing just 00 takes

Parsing 3457 files
regex: 4.126449

real    0m4.194s
user    0m13.465s
sys    0m0.414s
[download]

while processing 00 .. 04

Parsing 17285 files
regex: 20.588749

real    0m20.681s
user    1m7.373s
sys    0m1.873s
[download]

Nothing exponential here, right?

#!/usr/bin/perl
use strict;
use feature qw{ say };
use warnings;
use Env;
use utf8;
use Time::HiRes qw(gettimeofday tv_interval usleep);
use open ':std', ':encoding(UTF-8)';
use threads;
use Thread::Queue;


my $benchmark = 1; # print timings for loops
my $TMP='./tmp';
my $IN;
my $OUT;
my @data = glob("data-* ??/data-*");
my $filecount = scalar(@data);
die if $filecount < 0;

say "Parsing $filecount files";
my $wordfile="data.dat";
truncate $wordfile, 0;
#$|=1;
# substitute whole words
my %whole = qw{
  going go
  getting get
  goes go
  knew know
  trying try
  tried try
  told tell
  coming come
  saying say
  men man
  women woman
  took take
  lying lie
  dying die
};
# substitute on prefix
my %prefix = qw{
  need need
  talk talk
  tak take
  used use
  using use
};
# substitute on substring
my %substring = qw{
  mean mean
  work work
  read read
  allow allow
  gave give
  bought buy
  want want
  hear hear
  came come
  destr destroy
  paid pay
  selve self
  cities city
  fight fight
  creat create
  makin make
  includ include
};
my $re1 = qr{\b(@{[ join '|', reverse sort keys %whole ]})\b}i;
my $re2 = qr{\b(@{[ join '|', reverse sort keys %prefix ]})\w*}i;
my $re3 = qr{\b\w*?(@{[ join '|', reverse sort keys %substring ]})\w*}
+i;

truncate $wordfile, 0;
my $threads = 8;
my $forkcount = 0;
my $infile;
my $subdir = 0;
my $subdircount = 255;
my $tempdir = "temp";
mkdir "$tempdir";
mkdir "$tempdir/$subdir" while ($subdir++ <= $subdircount);
$subdir = 0;
my $i = 0;
my $t0 = [gettimeofday];
my $elapsed;

my $queue = 'Thread::Queue'->new;

sub process_file {
    while (my $task = $queue->dequeue) {
        my ($infile, $subdir, $i) = @$task;
        open my $IN, '<', $infile or exit(0);
        open my $OUT, '>', "$tempdir/$subdir/text-$i" or exit(0);
        while (<$IN>) {
            tr/-!"#%&()*',.\/:;?@\[\\\]農怒筑><^)(|/ /; # no punct "
            s/^/ /;
            s/\n/ \n/;
            s/[[:digit:]]{1,12}//g;
            s/w(as|ere)/be/gi;
            s{$re2}{ $prefix{lc $1} }g;  # prefix
            s{$re3}{ $substring{lc $1} }g;  # part
            s{$re1}{ $whole{lc $1} }g;  # whole
            print $OUT "$_";
        }
        close $OUT;
        close $IN;
    }
}
my @workers = map threads->create(\&process_file), 1 .. $threads;

foreach $infile (@data) {
    $subdir = 1 if $subdir++ > $subdircount;
    $queue->enqueue([$infile, $subdir, $i++]);
}
$queue->end;
$_->join for @workers;

local @ARGV = glob("$tempdir/*/*");
open $OUT, '>', $wordfile or die "Error opening $wordfile";
print {$OUT} $_ while <>;
close $OUT;
unlink glob "$tempdir/*/*";

$elapsed = tv_interval($t0);
print "regex: $elapsed\n" if $benchmark;
[download]

Note that I didn't try to understand what the code does, I just replaced fork with threads.

I also changed the final output to work line by line instead of reading the whole contents into memory, but it didn't help when using fork.

map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

Comment on Re: Script exponentially slower as number of files to process increases Select or Download Code

Replies are listed 'Best First'.
Re^2: Script exponentially slower as number of files to process increases by marioroy (Prior) on Jan 27, 2023 at 03:31 UTC
Does your Perl lack threads support? Fortunately, there is MCE::Child and MCE::Channel that run similarly to threads. The following are the changes to choroba's script. Basically, I replaced threads with MCE::Child and Thread::Queue with MCE::Channel. That's it, no other changes. `9,10c9,10 < use threads; < use Thread::Queue; --- > use MCE::Child; > use MCE::Channel; 88c88 < my $queue = 'Thread::Queue'->new; --- > my $queue = 'MCE::Channel'->new; 110c110 < my @workers = map threads->create(\&process_file), 1 .. $threads; --- > my @workers = map MCE::Child->create(\&process_file), 1 .. $threads;` [download] Let's see how they perform in a directory containing 35,841 files. I'm on a Linux box and running from /tmp/. The scripts are configured to spin 8 threads or processes. `# threads, Thread::Queue Parsing 35841 files regex: 12.427632 real 0m12.486s user 1m21.869s sys 0m1.009s # MCE::Child, MCE::Channel Parsing 35841 files regex: 8.971663 real 0m9.035s user 0m56.504s sys 0m1.097s` [download] Another monk, kikuchiyo posted a parallel demonstration. I'm running this simply for the monk whom may like to know how it performs. `Parsing 35841 files maxforks: 8 regex: 8.622583 real 0m8.953s user 0m52.559s sys 0m1.006s` [download] Seeing many cores near 100% simultaneously is magical. There is { threads, Thread::Queue }; { MCE::Child, MCE::Channels }; or roll your own. All three demonstrations work well. Let's imagine for a moment on becoming a CPU or the OS and a directory containing 350K files in it. Actually, imagine on being Perl itself. May I suggest a slight improvement... Try to populate the @data array after spawning threads or processes. This is especially true on the Windows platform. Unix OS'es benefit from Copy-on-Write, typically. That did not work for this use-case. See below for before and after results. It's quite natural to want to create the data array first, before spinning workers. The problem is that Perl threads make a copy, including emulated fork on the Windows platform. It's not likely a problem for a few thousand items. But 350K, that's unnecessary copy per each thread. `# threads (same applies to running MCE::Child or parallel module of yo +ur choice) my @workers = map threads->create(\&process_file), 1 .. $threads; my @data = glob("data-* ??/data-*"); my $filecount = scalar(@data); if ($filecount <= 0) { $queue->end; $_->join for @workers; die "there are no files to process"; } say "Parsing $filecount files"; foreach $infile (@data) { $subdir = 1 if $subdir++ > $subdircount; $queue->enqueue([$infile, $subdir, $i++]); } $queue->end; $_->join for @workers;` [download] I created a directory containing 135,842 files. Before: threads consume 178 MB; after update: threads consume 98 MB. Interestingly, for MCE::Child... before and after update: each worker process consume ~ 30 MB and ~ 10 MB, respectively. Next, I tested before and after for a directory containing 350K files; spawning 32 workers. Threads before and after update consume 1,122 MB and 240 MB, respectively. Likewise, each MCE::Child process consume before and after update ~ 63 MB and ~ 10 MB, distinctively.	[reply] [d/l] [select]
Re^2: Script exponentially slower as number of files to process increases by xnous (Sexton) on Jan 26, 2023 at 02:46 UTC
Indeed, your threads rewrite scales linearly and I will adopt it, thank you. However, I'm still intrigued as to why the fork version behaves as such. My "production" system is a 10-year old, 8-thread i7 but with plenty of RAM (32GB) to handle the task at hand. Initially, the script didn't even have a fork limit neither waitpid() or wait(), except the final one after the regex loop, as I just used the solution presented to me by Marshall on my previous question and it didn't make a difference in performance anyway, as I experimented with various limits ranging from 8 to 512, thinking at first that unconstrained forking was the cause. I noticed your version only pumps "user" time on my system monitor, while fork shows a significant amount of "system" time, at least 10% of total CPU, whether $maxforks is 8 or unlimited. All these said, I'd still like to find out why the initial script becomes so slow when the files to process multiply, even when I limit its scope to the first 1000 files. It's almost unreasonable.	[reply]
A reply falls below the community's threshold of quality. You may see it by logging in.