Does your Perl lack threads support? Fortunately, there is MCE::Child and MCE::Channel that run similarly to threads. The following are the changes to choroba's script. Basically, I replaced threads with MCE::Child and Thread::Queue with MCE::Channel. That's it, no other changes.
9,10c9,10 < use threads; < use Thread::Queue; --- > use MCE::Child; > use MCE::Channel; 88c88 < my $queue = 'Thread::Queue'->new; --- > my $queue = 'MCE::Channel'->new; 110c110 < my @workers = map threads->create(\&process_file), 1 .. $threads; --- > my @workers = map MCE::Child->create(\&process_file), 1 .. $threads;
Let's see how they perform in a directory containing 35,841 files. I'm on a Linux box and running from /tmp/. The scripts are configured to spin 8 threads or processes.
# threads, Thread::Queue Parsing 35841 files regex: 12.427632 real 0m12.486s user 1m21.869s sys 0m1.009s # MCE::Child, MCE::Channel Parsing 35841 files regex: 8.971663 real 0m9.035s user 0m56.504s sys 0m1.097s
Another monk, kikuchiyo posted a parallel demonstration. I'm running this simply for the monk whom may like to know how it performs.
Parsing 35841 files maxforks: 8 regex: 8.622583 real 0m8.953s user 0m52.559s sys 0m1.006s
Seeing many cores near 100% simultaneously is magical. There is { threads, Thread::Queue }; { MCE::Child, MCE::Channels }; or roll your own. All three demonstrations work well.
Let's imagine for a moment on becoming a CPU or the OS and a directory containing 350K files in it. Actually, imagine on being Perl itself. May I suggest a slight improvement... Try to populate the @data array after spawning threads or processes. This is especially true on the Windows platform. Unix OS'es benefit from Copy-on-Write, typically. That did not work for this use-case. See below for before and after results.
It's quite natural to want to create the data array first, before spinning workers. The problem is that Perl threads make a copy, including emulated fork on the Windows platform. It's not likely a problem for a few thousand items. But 350K, that's unnecessary copy per each thread.
# threads (same applies to running MCE::Child or parallel module of yo +ur choice) my @workers = map threads->create(\&process_file), 1 .. $threads; my @data = glob("data-* ??/data-*"); my $filecount = scalar(@data); if ($filecount <= 0) { $queue->end; $_->join for @workers; die "there are no files to process"; } say "Parsing $filecount files"; foreach $infile (@data) { $subdir = 1 if $subdir++ > $subdircount; $queue->enqueue([$infile, $subdir, $i++]); } $queue->end; $_->join for @workers;
I created a directory containing 135,842 files. Before: threads consume 178 MB; after update: threads consume 98 MB. Interestingly, for MCE::Child... before and after update: each worker process consume ~ 30 MB and ~ 10 MB, respectively.
Next, I tested before and after for a directory containing 350K files; spawning 32 workers. Threads before and after update consume 1,122 MB and 240 MB, respectively. Likewise, each MCE::Child process consume before and after update ~ 63 MB and ~ 10 MB, distinctively.
In reply to Re^2: Script exponentially slower as number of files to process increases
by marioroy
in thread Script exponentially slower as number of files to process increases
by xnous
For: | Use: | ||
& | & | ||
< | < | ||
> | > | ||
[ | [ | ||
] | ] |