Does your Perl lack threads support? Fortunately, there is MCE::Child and MCE::Channel that run similarly to threads. The following are the changes to choroba's script. Basically, I replaced threads with MCE::Child and Thread::Queue with MCE::Channel. That's it, no other changes.

9,10c9,10 < use threads; < use Thread::Queue; --- > use MCE::Child; > use MCE::Channel; 88c88 < my $queue = 'Thread::Queue'->new; --- > my $queue = 'MCE::Channel'->new; 110c110 < my @workers = map threads->create(\&process_file), 1 .. $threads; --- > my @workers = map MCE::Child->create(\&process_file), 1 .. $threads;

Let's see how they perform in a directory containing 35,841 files. I'm on a Linux box and running from /tmp/. The scripts are configured to spin 8 threads or processes.

# threads, Thread::Queue Parsing 35841 files regex: 12.427632 real 0m12.486s user 1m21.869s sys 0m1.009s # MCE::Child, MCE::Channel Parsing 35841 files regex: 8.971663 real 0m9.035s user 0m56.504s sys 0m1.097s

Another monk, kikuchiyo posted a parallel demonstration. I'm running this simply for the monk whom may like to know how it performs.

Parsing 35841 files maxforks: 8 regex: 8.622583 real 0m8.953s user 0m52.559s sys 0m1.006s

Seeing many cores near 100% simultaneously is magical. There is { threads, Thread::Queue }; { MCE::Child, MCE::Channels }; or roll your own. All three demonstrations work well.

Let's imagine for a moment on becoming a CPU or the OS and a directory containing 350K files in it. Actually, imagine on being Perl itself. May I suggest a slight improvement... Try to populate the @data array after spawning threads or processes. This is especially true on the Windows platform. Unix OS'es benefit from Copy-on-Write, typically. That did not work for this use-case. See below for before and after results.

It's quite natural to want to create the data array first, before spinning workers. The problem is that Perl threads make a copy, including emulated fork on the Windows platform. It's not likely a problem for a few thousand items. But 350K, that's unnecessary copy per each thread.

# threads (same applies to running MCE::Child or parallel module of yo +ur choice) my @workers = map threads->create(\&process_file), 1 .. $threads; my @data = glob("data-* ??/data-*"); my $filecount = scalar(@data); if ($filecount <= 0) { $queue->end; $_->join for @workers; die "there are no files to process"; } say "Parsing $filecount files"; foreach $infile (@data) { $subdir = 1 if $subdir++ > $subdircount; $queue->enqueue([$infile, $subdir, $i++]); } $queue->end; $_->join for @workers;

I created a directory containing 135,842 files. Before: threads consume 178 MB; after update: threads consume 98 MB. Interestingly, for MCE::Child... before and after update: each worker process consume ~ 30 MB and ~ 10 MB, respectively.

Next, I tested before and after for a directory containing 350K files; spawning 32 workers. Threads before and after update consume 1,122 MB and 240 MB, respectively. Likewise, each MCE::Child process consume before and after update ~ 63 MB and ~ 10 MB, distinctively.


In reply to Re^2: Script exponentially slower as number of files to process increases by marioroy
in thread Script exponentially slower as number of files to process increases by xnous

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.