Re^2: Script exponentially slower as number of files to process increases

Does your Perl lack threads support? Fortunately, there is MCE::Child and MCE::Channel that run similarly to threads. The following are the changes to choroba's script. Basically, I replaced threads with MCE::Child and Thread::Queue with MCE::Channel. That's it, no other changes.

9,10c9,10
< use threads;
< use Thread::Queue;
---
> use MCE::Child;
> use MCE::Channel;
88c88
< my $queue = 'Thread::Queue'->new;
---
> my $queue = 'MCE::Channel'->new;
110c110
< my @workers = map threads->create(\&process_file), 1 .. $threads;
---
> my @workers = map MCE::Child->create(\&process_file), 1 .. $threads;
[download]

Let's see how they perform in a directory containing 35,841 files. I'm on a Linux box and running from /tmp/. The scripts are configured to spin 8 threads or processes.

# threads, Thread::Queue

Parsing 35841 files
regex: 12.427632

real  0m12.486s
user  1m21.869s
sys   0m1.009s


# MCE::Child, MCE::Channel

Parsing 35841 files
regex: 8.971663

real  0m9.035s
user  0m56.504s
sys   0m1.097s
[download]

Another monk, kikuchiyo posted a parallel demonstration. I'm running this simply for the monk whom may like to know how it performs.

Parsing 35841 files
maxforks: 8
regex: 8.622583

real  0m8.953s
user  0m52.559s
sys   0m1.006s
[download]

Seeing many cores near 100% simultaneously is magical. There is { threads, Thread::Queue }; { MCE::Child, MCE::Channels }; or roll your own. All three demonstrations work well.

Let's imagine for a moment on becoming a CPU or the OS and a directory containing 350K files in it. Actually, imagine on being Perl itself. May I suggest a slight improvement... Try to populate the @data array after spawning threads or processes. This is especially true on the Windows platform. Unix OS'es benefit from Copy-on-Write, typically. That did not work for this use-case. See below for before and after results.

It's quite natural to want to create the data array first, before spinning workers. The problem is that Perl threads make a copy, including emulated fork on the Windows platform. It's not likely a problem for a few thousand items. But 350K, that's unnecessary copy per each thread.

# threads (same applies to running MCE::Child or parallel module of yo
+ur choice)

my @workers = map threads->create(\&process_file), 1 .. $threads;

my @data = glob("data-* ??/data-*");
my $filecount = scalar(@data);
if ($filecount <= 0) {
    $queue->end;
    $_->join for @workers;
    die "there are no files to process";
}

say "Parsing $filecount files";
foreach $infile (@data) {
    $subdir = 1 if $subdir++ > $subdircount;
    $queue->enqueue([$infile, $subdir, $i++]);
}
$queue->end;
$_->join for @workers;
[download]

I created a directory containing 135,842 files. Before: threads consume 178 MB; after update: threads consume 98 MB. Interestingly, for MCE::Child... before and after update: each worker process consume ~ 30 MB and ~ 10 MB, respectively.

Next, I tested before and after for a directory containing 350K files; spawning 32 workers. Threads before and after update consume 1,122 MB and 240 MB, respectively. Likewise, each MCE::Child process consume before and after update ~ 63 MB and ~ 10 MB, distinctively.

Comment on Re^2: Script exponentially slower as number of files to process increases Select or Download Code


We don't bite newbies here... much
	PerlMonks