Does your Perl lack threads support? Fortunately, there is MCE::Child and MCE::Channel that run similarly to threads. The following are the changes to choroba's script. Basically, I replaced threads with MCE::Child and Thread::Queue with MCE::Channel. That's it, no other changes.
9,10c9,10
< use threads;
< use Thread::Queue;
---
> use MCE::Child;
> use MCE::Channel;
88c88
< my $queue = 'Thread::Queue'->new;
---
> my $queue = 'MCE::Channel'->new;
110c110
< my @workers = map threads->create(\&process_file), 1 .. $threads;
---
> my @workers = map MCE::Child->create(\&process_file), 1 .. $threads;
Let's see how they perform in a directory containing 35,841 files. I'm on a Linux box and running from /tmp/. The scripts are configured to spin 8 threads or processes.
# threads, Thread::Queue
Parsing 35841 files
regex: 12.427632
real 0m12.486s
user 1m21.869s
sys 0m1.009s
# MCE::Child, MCE::Channel
Parsing 35841 files
regex: 8.971663
real 0m9.035s
user 0m56.504s
sys 0m1.097s
Another monk, kikuchiyo posted a parallel demonstration. I'm running this simply for the monk whom may like to know how it performs.
Parsing 35841 files
maxforks: 8
regex: 8.622583
real 0m8.953s
user 0m52.559s
sys 0m1.006s
Seeing many cores near 100% simultaneously is magical. There is { threads, Thread::Queue }; { MCE::Child, MCE::Channels }; or roll your own. All three demonstrations work well.
Let's imagine for a moment on becoming a CPU or the OS and a directory containing 350K files in it. Actually, imagine on being Perl itself. May I suggest a slight improvement... Try to populate the @data array after spawning threads or processes. This is especially true on the Windows platform. Unix OS'es benefit from Copy-on-Write, typically. That did not work for this use-case. See below for before and after results.
It's quite natural to want to create the data array first, before spinning workers. The problem is that Perl threads make a copy, including emulated fork on the Windows platform. It's not likely a problem for a few thousand items. But 350K, that's unnecessary copy per each thread.
# threads (same applies to running MCE::Child or parallel module of yo
+ur choice)
my @workers = map threads->create(\&process_file), 1 .. $threads;
my @data = glob("data-* ??/data-*");
my $filecount = scalar(@data);
if ($filecount <= 0) {
$queue->end;
$_->join for @workers;
die "there are no files to process";
}
say "Parsing $filecount files";
foreach $infile (@data) {
$subdir = 1 if $subdir++ > $subdircount;
$queue->enqueue([$infile, $subdir, $i++]);
}
$queue->end;
$_->join for @workers;
I created a directory containing 135,842 files. Before: threads consume 178 MB; after update: threads consume 98 MB. Interestingly, for MCE::Child... before and after update: each worker process consume ~ 30 MB and ~ 10 MB, respectively.
Next, I tested before and after for a directory containing 350K files; spawning 32 workers. Threads before and after update consume 1,122 MB and 240 MB, respectively. Likewise, each MCE::Child process consume before and after update ~ 63 MB and ~ 10 MB, distinctively. |