in reply to Re: script optmization
in thread script optmization

For larger data files and not wanting to deal with chunking manually, then there is the parallel MCE module. This is what one might construct using the MCE::Flow module. We're running 4 workers. Therefore chunking at 24 MB is plenty. Perl and CPAN are amazing allowing this.

use strict; use warnings; use autodie; use MCE::Flow; open Newfile, ">", "./Newfile.txt" or die "Cannot create Newfile.txt"; Newfile->autoflush(1); # important, enable autoflush my ($f1, $f2, @seq) = ('seq.txt', 'mytext.txt'); open(my $fh, $f1); foreach (<$fh>) { chomp; s/^\s+|\s+$//g; push @seq, $_; } close $fh; @seq = sort bylen @seq; # need to sort @seq by length. MCE::Flow::init { max_workers => 4, chunk_size => '24m', init_relay => 1, use_slurpio => 1, }; # For best performance, provide MCE the path, e.g. $f2 # versus a file handle. Workers communicate among themselves # the next offset without involving the manager process. mce_flow_f sub { my ($mce, $slurp_ref, $chunk_id) = @_; foreach my $r (@seq) { my $t = $r; $t =~ s/\h+/bbb/g; $$slurp_ref =~ s/$r/$t/g; } # Relay capability is useful for running something orderly. # For this use case, we've enabled autoflush on the file above. # Only one worker is allowed to run when entering the block. MCE::relay sub { print Newfile $$slurp_ref }; }, $f2; MCE::Flow::finish(); close Newfile; exit 0; sub bylen { length($b) <=> length($a); }

Replies are listed 'Best First'.
Re^3: script optmization
by Anonymous Monk on May 14, 2017 at 23:00 UTC

    Time to run against a 200 MB file. That is the OP's input file and appending the same 884,873 times to make a 200 MB file.

    serial: 12.557 seconds

    slurped: 1.644 seconds 7.6 x

    parallel: 0.531 seconds 23.6 x

      For memory constraint systems, one may run using a smaller size for chunk_size. In this case 4 MB will do just fine. For this demonstration, a smaller chunk size value decreases the wait time for workers to read again. It completes in 0.476 seconds for 26.4 x performance increase.

      chunk_size => '4m',

      The slurped example is likely fast enough. Parallel is nice if you want that. But, serial doesn't take that long either. All completed in less than 20 seconds.