Re^2: script optmization

For larger data files and not wanting to deal with chunking manually, then there is the parallel MCE module. This is what one might construct using the MCE::Flow module. We're running 4 workers. Therefore chunking at 24 MB is plenty. Perl and CPAN are amazing allowing this.

use strict;
use warnings;
use autodie;
use MCE::Flow;

open Newfile, ">", "./Newfile.txt" or die "Cannot create Newfile.txt";

Newfile->autoflush(1);  # important, enable autoflush

my ($f1, $f2, @seq) = ('seq.txt', 'mytext.txt');

open(my $fh, $f1);
foreach (<$fh>) {
  chomp; s/^\s+|\s+$//g;
  push @seq, $_;
}
close $fh;

@seq = sort bylen @seq;  # need to sort @seq by length.

MCE::Flow::init {
  max_workers => 4, chunk_size  => '24m',
  init_relay  => 1, use_slurpio => 1,
};

# For best performance, provide MCE the path, e.g. $f2
# versus a file handle. Workers communicate among themselves
# the next offset without involving the manager process.

mce_flow_f sub {
  my ($mce, $slurp_ref, $chunk_id) = @_;

  foreach my $r (@seq) {
    my $t = $r; $t =~ s/\h+/bbb/g;
    $$slurp_ref =~ s/$r/$t/g;
  }

  # Relay capability is useful for running something orderly.
  # For this use case, we've enabled autoflush on the file above.
  # Only one worker is allowed to run when entering the block.

  MCE::relay sub { print Newfile $$slurp_ref };
},
$f2;

MCE::Flow::finish();

close Newfile;

exit 0;

sub bylen {
  length($b) <=> length($a);
}
[download]

Comment on Re^2: script optmization Download Code

Replies are listed 'Best First'.
Re^3: script optmization by Anonymous Monk on May 14, 2017 at 23:00 UTC
Time to run against a 200 MB file. That is the OP's input file and appending the same 884,873 times to make a 200 MB file. serial: 12.557 seconds slurped: 1.644 seconds 7.6 x parallel: 0.531 seconds 23.6 x	[reply]
Re^4: script optmization by Anonymous Monk on May 14, 2017 at 23:27 UTC
For memory constraint systems, one may run using a smaller size for chunk_size. In this case 4 MB will do just fine. For this demonstration, a smaller chunk size value decreases the wait time for workers to read again. It completes in 0.476 seconds for 26.4 x performance increase. `chunk_size => '4m',` [download] The slurped example is likely fast enough. Parallel is nice if you want that. But, serial doesn't take that long either. All completed in less than 20 seconds.	[reply] [d/l]