comment on

Does your Perl lack threads support? Fortunately, there is MCE::Child and MCE::Channel that run similarly to threads. The following are the changes to choroba's script. Basically, I replaced threads with MCE::Child and Thread::Queue with MCE::Channel. That's it, no other changes.

9,10c9,10
< use threads;
< use Thread::Queue;
---
> use MCE::Child;
> use MCE::Channel;
88c88
< my $queue = 'Thread::Queue'->new;
---
> my $queue = 'MCE::Channel'->new;
110c110
< my @workers = map threads->create(\&process_file), 1 .. $threads;
---
> my @workers = map MCE::Child->create(\&process_file), 1 .. $threads;
[download]

Let's see how they perform in a directory containing 35,841 files. I'm on a Linux box and running from /tmp/. The scripts are configured to spin 8 threads or processes.

# threads, Thread::Queue

Parsing 35841 files
regex: 12.427632

real  0m12.486s
user  1m21.869s
sys   0m1.009s


# MCE::Child, MCE::Channel

Parsing 35841 files
regex: 8.971663

real  0m9.035s
user  0m56.504s
sys   0m1.097s
[download]

Another monk, kikuchiyo posted a parallel demonstration. I'm running this simply for the monk whom may like to know how it performs.

Parsing 35841 files
maxforks: 8
regex: 8.622583

real  0m8.953s
user  0m52.559s
sys   0m1.006s
[download]

Seeing many cores near 100% simultaneously is magical. There is { threads, Thread::Queue }; { MCE::Child, MCE::Channels }; or roll your own. All three demonstrations work well.

Let's imagine for a moment on becoming a CPU or the OS and a directory containing 350K files in it. Actually, imagine on being Perl itself. May I suggest a slight improvement... Try to populate the @data array after spawning threads or processes. This is especially true on the Windows platform. Unix OS'es benefit from Copy-on-Write, typically. That did not work for this use-case. See below for before and after results.

It's quite natural to want to create the data array first, before spinning workers. The problem is that Perl threads make a copy, including emulated fork on the Windows platform. It's not likely a problem for a few thousand items. But 350K, that's unnecessary copy per each thread.

# threads (same applies to running MCE::Child or parallel module of yo
+ur choice)

my @workers = map threads->create(\&process_file), 1 .. $threads;

my @data = glob("data-* ??/data-*");
my $filecount = scalar(@data);
if ($filecount <= 0) {
    $queue->end;
    $_->join for @workers;
    die "there are no files to process";
}

say "Parsing $filecount files";
foreach $infile (@data) {
    $subdir = 1 if $subdir++ > $subdircount;
    $queue->enqueue([$infile, $subdir, $i++]);
}
$queue->end;
$_->join for @workers;
[download]

I created a directory containing 135,842 files. Before: threads consume 178 MB; after update: threads consume 98 MB. Interestingly, for MCE::Child... before and after update: each worker process consume ~ 30 MB and ~ 10 MB, respectively.

Next, I tested before and after for a directory containing 350K files; spawning 32 workers. Threads before and after update consume 1,122 MB and 240 MB, respectively. Likewise, each MCE::Child process consume before and after update ~ 63 MB and ~ 10 MB, distinctively.

In reply to Re^2: Script exponentially slower as number of files to process increases by marioroy
in thread Script exponentially slower as number of files to process increases by xnous

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.