in reply to Re: Need to speed up many regex substitutions and somehow make them a here-doc list (MCE solution)
in thread Need to speed up many regex substitutions and somehow make them a here-doc list

Sometimes, I like to know "really" what the overhead is for MCE. So, here is something to measure the chunking nature of MCE. Simply comment out user_begin, user_end, and the process routine. That's it.

#!/usr/bin/env perl use strict; use warnings; use MCE; use Time::HiRes 'time'; die "usage: $0 infile1.txt\n" unless @ARGV; my $OUT_FH; # output file-handle used by workers # Spawn worker pool. my $mce = MCE->new( max_workers => MCE::Util::get_ncpu(), chunk_size => '64K', init_relay => 0, # specifying init_relay loads MCE::Relay use_slurpio => 1, # enable slurpio # user_begin => sub { # # worker begin routine per each file to be processed # my ($outfile) = @{ MCE->user_args() }; # open $OUT_FH, '>>', $outfile; # }, # user_end => sub { # # worker end routine per each file to be processed # close $OUT_FH if defined $OUT_FH; # }, user_func => sub { # worker chunk routine my ($mce, $chunk_ref, $chunk_id) = @_; # process_chunk($chunk_ref); } )->spawn; my $start = time; $mce->process($ARGV[0]); printf "%0.3f seconds\n", time - $start;

I have a big file which is 767 MB. The overhead is a fraction of a second.

$ ls -lh big.txt -rw-r--r-- 1 mario mario 767M Oct 5 10:07 big.txt $ perl demo.pl big.txt 0.154 seconds

Edit: That was from OS level cache as I had read the file from prior testing.

  • Comment on Re^2: Need to speed up many regex substitutions and somehow make them a here-doc list (MCE solution)
  • Select or Download Code

Replies are listed 'Best First'.
Re^3: Need to speed up many regex substitutions and somehow make them a here-doc list (MCE solution)
by marioroy (Prior) on Oct 08, 2022 at 01:52 UTC

    The OP mentioned a large number of text files (thousands to millions at a time, up to a couple of MB each). I think that parallelization is better broken down at the file level. Basically, create a list of input files and chunk the list instead. Since the list may range from thousands to millions, go with chunk_size 1 or 2.

    Notice that workers are spawned early, before creating a large array. Create the array and pass the array reference to MCE to not make an extra copy. This is how to tackle a big job, keeping overhead low. And then, fasten your seat belt and enjoy parallelization in top or htop.

    use strict; use warnings; use MCE; use Time::HiRes 'time'; sub process_file { my ($file) = @_; } my $mce = MCE->new( max_workers => MCE::Util::get_ncpu(), chunk_size => 2, user_func => sub { my ($mce, $chunk_ref, $chunk_id) = @_; process_file($_) for @{ $chunk_ref }; } )->spawn; my @file_list = (1 .. 1_000_000); # simulate a list of 1 million files my $start = time; $mce->process(\@file_list); printf "%0.3f seconds\n", time - $start; $mce->shutdown; # reap workers

    Let's find out the IPC overhead. I wonder myself.

    chunk_size 1 3.773 seconds 1 million chunks chunk_size 2 1.930 seconds 500 thousand chunks chunk_size 10 0.423 seconds 100 thousand chunks chunk_size 20 0.234 seconds 50 thousand chunks

    It is mind-boggling nonetheless, just a fraction of a second for 50 thousand chunks. Moreover, 2 seconds will not be felt when processing 500 thousand files. Nor, 4 seconds handling 1 million files.