Re^2: Need to speed up many regex substitutions and somehow make them a here-doc list (MCE solution)

Sometimes, I like to know "really" what the overhead is for MCE. So, here is something to measure the chunking nature of MCE. Simply comment out user_begin, user_end, and the process routine. That's it.

#!/usr/bin/env perl

use strict;
use warnings;
use MCE;
use Time::HiRes 'time';

die "usage: $0 infile1.txt\n" unless @ARGV;

my $OUT_FH; # output file-handle used by workers

# Spawn worker pool.
my $mce = MCE->new(
    max_workers => MCE::Util::get_ncpu(),
    chunk_size  => '64K',
    init_relay  => 0, # specifying init_relay loads MCE::Relay
    use_slurpio => 1, # enable slurpio
  # user_begin  => sub {
  #     # worker begin routine per each file to be processed
  #     my ($outfile) = @{ MCE->user_args() };
  #     open $OUT_FH, '>>', $outfile;
  # },
  # user_end => sub {
  #     # worker end routine per each file to be processed
  #     close $OUT_FH if defined $OUT_FH;
  # },
    user_func => sub {
        # worker chunk routine
        my ($mce, $chunk_ref, $chunk_id) = @_;
  #     process_chunk($chunk_ref);
    }
)->spawn;

my $start = time;
$mce->process($ARGV[0]);
printf "%0.3f seconds\n", time - $start;
[download]

I have a big file which is 767 MB. The overhead is a fraction of a second.

$ ls -lh big.txt 
-rw-r--r-- 1 mario mario 767M Oct  5 10:07 big.txt

$ perl demo.pl big.txt 
0.154 seconds
[download]

Edit: That was from OS level cache as I had read the file from prior testing.

Comment on Re^2: Need to speed up many regex substitutions and somehow make them a here-doc list (MCE solution) Select or Download Code

Replies are listed 'Best First'.
Re^3: Need to speed up many regex substitutions and somehow make them a here-doc list (MCE solution) by marioroy (Prior) on Oct 08, 2022 at 01:52 UTC
The OP mentioned a large number of text files (thousands to millions at a time, up to a couple of MB each). I think that parallelization is better broken down at the file level. Basically, create a list of input files and chunk the list instead. Since the list may range from thousands to millions, go with chunk_size 1 or 2. Notice that workers are spawned early, before creating a large array. Create the array and pass the array reference to MCE to not make an extra copy. This is how to tackle a big job, keeping overhead low. And then, fasten your seat belt and enjoy parallelization in top or htop. `use strict; use warnings; use MCE; use Time::HiRes 'time'; sub process_file { my ($file) = @_; } my $mce = MCE->new( max_workers => MCE::Util::get_ncpu(), chunk_size => 2, user_func => sub { my ($mce, $chunk_ref, $chunk_id) = @_; process_file($_) for @{ $chunk_ref }; } )->spawn; my @file_list = (1 .. 1_000_000); # simulate a list of 1 million files my $start = time; $mce->process(\@file_list); printf "%0.3f seconds\n", time - $start; $mce->shutdown; # reap workers` [download] Let's find out the IPC overhead. I wonder myself. `chunk_size 1 3.773 seconds 1 million chunks chunk_size 2 1.930 seconds 500 thousand chunks chunk_size 10 0.423 seconds 100 thousand chunks chunk_size 20 0.234 seconds 50 thousand chunks` [download] It is mind-boggling nonetheless, just a fraction of a second for 50 thousand chunks. Moreover, 2 seconds will not be felt when processing 500 thousand files. Nor, 4 seconds handling 1 million files.	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^3: Need to speed up many regex substitutions and somehow make them a here-doc list (MCE solution)
by marioroy (Prior) on Oct 08, 2022 at 01:52 UTC

The OP mentioned a large number of text files (thousands to millions at a time, up to a couple of MB each). I think that parallelization is better broken down at the file level. Basically, create a list of input files and chunk the list instead. Since the list may range from thousands to millions, go with chunk_size 1 or 2.

Notice that workers are spawned early, before creating a large array. Create the array and pass the array reference to MCE to not make an extra copy. This is how to tackle a big job, keeping overhead low. And then, fasten your seat belt and enjoy parallelization in top or htop.

use strict;
use warnings;
use MCE;
use Time::HiRes 'time';

sub process_file {
    my ($file) = @_;
}

my $mce = MCE->new(
    max_workers => MCE::Util::get_ncpu(),
    chunk_size  => 2,
    user_func   => sub {
        my ($mce, $chunk_ref, $chunk_id) = @_;
        process_file($_) for @{ $chunk_ref };
    }
)->spawn;

my @file_list = (1 .. 1_000_000); # simulate a list of 1 million files

my $start = time;
$mce->process(\@file_list);
printf "%0.3f seconds\n", time - $start;

$mce->shutdown; # reap workers
[download]

Let's find out the IPC overhead. I wonder myself.

chunk_size   1  3.773 seconds    1 million chunks
chunk_size   2  1.930 seconds  500 thousand chunks
chunk_size  10  0.423 seconds  100 thousand chunks
chunk_size  20  0.234 seconds   50 thousand chunks
[download]

It is mind-boggling nonetheless, just a fraction of a second for 50 thousand chunks. Moreover, 2 seconds will not be felt when processing 500 thousand files. Nor, 4 seconds handling 1 million files.

[reply]
[d/l]
[select]